ECONOMETRICS AN INTRO DUCTION TO MAXIMUM LIKELIHOOD METHODS S. Valavanis Published on demand by UNIVERSITY MICROFILMS University Microfilms Limited, High Wycomb, England A Xerox Company, Ann Arbor, Michigan, U.S.A. B& m- 9 mWMm WBHHBUm mm HUME LIBRARY INSTITUTE OF FOOD AND AGRICULTURAL SCIENCES UNIVERSITY OF FLORIDA Gainesville m />- 'J Digitized by the Internet Archive in 2013 http://archive.org/details/econometricsOOvala This is an authorized facsimile of the original book, and was produced in 197^ by microfilm- xerography by Xerox University Microfilms, Ann Arbor, Michigan, U.S.A. EC©N©METItIC§ An Introduction to Maximum Likelihood Methods STEFAN VALAVANIS Assistant Professor of Economics Harvard University, 1956 to 1958 EDITED, FROM MANUSCRIPT, BY ALFRED H. CONRAD Assistant Professor of Economics Harvard University D , u) ' y I* 7 1959 New York Toronto London McGRAW-HILL BOOK COMPANY, INC, h i ECONOMICS HANDBOOK SERIES SEYMOUR E. HARRIS, Editor Advisory Committee: Edward H. Chamberlain, Gottfried Haberler, Alvin H. Hansen, Edward S. Mason, and John H. Williams. All of Harvard University, Burns • Social Security and Public Policy Duesenberry • Business Cycles and Economic Growth Hansen • The American Economy Hansen • A Guide to Keynes Hansen • Monetary Theory and Fiscal Policy Harris • International and Interregional Economics Henderson and Quandt • Microeconomic Theory Hoover • The Location of Economic Activity Kindleberger • Economic Development Lerner • Economics of Employment Valavanis • Econometrics ECONOMETRICS ECONOMETRICS. Copyright © 1059 by the McGraw-Hill Book Company, Inc. Printed in the United States of America. All rights reserved. This book, or parts thereof, may not be reproduced in any form without permis- sion of the publishers. Library of Congress Catalog Card Number 58-14363 THE MAPLE PRESS COMPANY, YORK, PA. Editor's introduction For years many teachers of economics and other professional economists have felt the need of a series of books on economic subjects which is not filled by the usual textbook, nor by the highly technical treatise. This present series, published under the general title of Economics Handbook Series, was planned with these needs in mind. Designed first of all for students, the volumes are useful in the ever-growing field of adult education and also are of interest to the informed general reader. The volumes present a distillate of accepted theory and practice, without the detailed approach of the technical treatise. Each volume is a unit, standing on its own. The authors are scholars, each writing on an economic subject on C which he is an authority. In this series the author's first task was not , 2 to make important contributions to knowledge— although many of ^ them do — but so to present his subject matter that his work as a „ v «~J scholar will carry its maximum influence outside as well as inside the cv classroom. The time has come to redress the balance between the energies spent on the creation of new ideas and on their dissemination. Economic ideas are unproductive if they do not spread beyond the world of scholars. Popularizers without technical competence, v unqualified textbook writers, and sometimes even charlatans control i too laige a part of the market for economic ideas. In the classroom the Economics Handbook Series will seive, it is hoped, as brief surveys in one-semester courses, as supplementary reading in introductory courses, and in other courses in which the subject is related. Seymour E. Harris > J , v V Nl*" Editor's preface The editor welcomes Stefan Valavanis' study of econometrics into the Economics Handbook Series as a unique contribution to eco- nometrics and to the teaching of the subject. Anyone who reads this book will understand the tragedy of the death of Stefan Valavanis. He was brilliant, imaginative, and a first- class scholar and teacher, and his death is a great loss to the world of ideas. Professor Valavanis had virtually completed his book just befora his departure for Europe in the summer of 1958. But, as is alwaya true of a manuscript left with the publisher, though it was essentially complete much remained to be done. My colleague, Professor Alfred H, Conrad, volunteered to finish the job. Unselfishly he put the final touches on the book, went over the manuscript, checked the math- ematics, assumed the responsibility for seeing it through the press, and helped in many other ways. Without his help, the problem of publication would have been a serious one. The publisher and editor are indeed grateful. This book is an introduction to econometrics, that is, to the tech- niques by which economic theories are brought into contact with the facts. While not in any sense a "cookbook," its orientation is constantly toward the strategy of economic research. Within the field of econometrics, the book is primarily addressed to the problems of estimation rather than to the testing of hypotheses. It is concerned with estimating, from the insufficient information available, the values viii editor's preface or magnitudes of the variables and relationships suggested by economic analysis. The maximum likelihood and limited information tech- niques are developed from fundamental assumptions and criteria and demonstrated by example; their costs in accuracy and computation are weighed. There are short but careful treatments of identification, instrumental variables, factor analysis, and hypothesis testing. The book proceeds much more by statements of problems and examples than by the development of mathematical proofs. The main feature of this book is its pedagogical strength. While rigor is not sacrificed and no mathematical or statistical rabbits are pulled out of the author's hat, the statistical tools are always presented in terms of the fundamental limitations and criteria of the real world. Almost every concept is introduced by an example set in this world of real problems and difficulties. Mathematical concepts and notational distinctions are most often handled in clearly set off "digressions." The fundamental notions of probability and matrix algebra are reviewed, but in general it is assumed that the student has already been introduced to determinants and matrices and the elementary properties and processes of differentiation. (No more knowledge of mathematics is required than for any of the other comparable texts, and, thanks to the pedagogical skills of the author, probably considerably less.) Frequent emphasis is placed upon com- putation design and requirements. Valavanis' book is brilliantly organized for classroom presentation, most of the statistical and mathematical assumptions and concepts being treated verbally and by example before they appear in any mathematical formulation. In addition to the examples used in presentation, there are exercises in almost every chapter. Seymour E. Harris Preface This work is neither a complete nor a systematic treatment of econometrics. It definitely is not empirical. It has one unifying idea: to reduce to common-sense terms the mathematical statistics on which the theory of econometrics rests. If anything in econometrics (or in any other field) makes sensa, one ought to be able to put it into words. The result may not ba so com- pact as a close-knit mathematical exposition, but it can be, In its own way, just as elegant and clear. Putting symbols and jargon into words understandable to a wider audience is not the only thing I want to do. I think that watering down a highly refined or a very deep mathematical argument is a useful activity. For instance, if the essence of a problem can be captured by two variables, why tackle n? Or why worry about mathematical continuity, existence, and singularity in a discussion of economic matters, unless these intriguing properties have interesting economic counterparts? We would be misspending effort if all the reader wants is an intelligent layman's idea of what is going on in the field of econometrics. For the sake of the punctilious, I shall give warning every time my heuristic "proof" is not watertight or when- ever I slur over an unessential mathematical maze. Much of econometric literature suffers from overfancy notation. If I judge rightly, many people quake at the sight of every new issue of Econometrica. I hope to show them the intuitive good sense that hides behind the mathematical squiggles. ix X PREFACE Besides restoring the self-assurance of the ordinary intelligent reader and helping him discriminate between really important develop- ments in econometric method and mere mathematical quibbles, I have tried to be useful to the teachers of econometrics and peripheral subjects by supplying them with material in "pedagogic" form. And lastly, I should like to amuse and surprise the serious or expert eco- nometrician, the connoisseur, by serving him familiar hash in nice new palatable ways but without loss of nutrient substance. The gaps in this work are intentional. One cannot, from this book alone, learn econometrics from the ground up; one must pick up elementary statistical notions, algebra, a little calculus — even some econometrics — elsewhere. For the beginner in econometrics, an approximately correct sequence would be the books of Beach (1957), Tinbergen (1951), Klein (1953), and Hood (1953); with Tintner (1942) as a source of examples or museum for the numerous varieties of quantitative techniques in existence. Tinbergen emphasizes economic policy; Klein, the busi- ness cycle and macro-economics; Tintner, the testing of hypotheses and the analysis of time series. All three use interesting empirical examples. For elementary mathematics the first part of Beach (perhaps also Klein, appendix on matrices) is enough. Reference to all these easily available and digestible texts is meant to avoid my repeating what has been said by others. From time to time, however, I make certain "digressions"; these are held in from the margins. These digressions have to do mostly with mathematical and statistical subjects that in my opinion are either inaccessible or not well explained elsewhere. Stefan Valavanis Acknowledgments Harvard University, for grants from the Joseph H. Clark Bequest and from the Ford Foundation's Small Research Fund in the Department of Economics. Professor and Mrs. William Jaff6 Frofessor Arthur S. Goldberger Professor Richard E. Quandt Professor Frederick F. Stephan Contents Editor's introduction V Editor's preface vii Preface IZ Digressions xv Frequent references ivii Chapter 1. The fundamental proposition of econometrics . , . 1 1.1. What econometrics is about . 1 1.2. Mathematical tools 2 1.3. Outline of procedure and main discoveries in the next hundred pages 3 1.4. All-importance of statistical assumptions 4 1.5. Rationalization of the error term 5 1.6. The fundamental proposition 6 1.7. Population and sample * . 7 1.8. Parameters and estimates 8 1.9. Assumptions about the error term 9 1. u is a random real variable 9 2. ut, for every t, has zero expected value 10 3. The variance of u t is constant over time 13 4. The error term is normally distributed 14 5. The random terms of different time periods are independent . . 16 6. The error is not correlated with any predetermined variable . . 16 1.10. Mathematical restatement of the Six Simplifying Assumptions . . 17 1.11. Interpretation of additivity 18 1.12. Recapitulation 18 Further readings 21 xi XII CONTENTS Chapter 2. Estimating criteria and the method of least squares . 23 2.1. Outline of the chapter 23 2.2. Probability and likelihood 24 2.3. The concept of likelihood function 28 2.4. The form of the likelihood function 29 2.5. Justification of the least squares technique 31 2.G. Generalized least squares 33 2.7. The meaning of unbiasedness 35 2.8. Variance of the estimate 39 2.9. Estimates of the variance of the estimate 41 2.10. Estimates ad nam earn 43 2.11. The meaning of consistency 43 2.12. The merits of unbiasedness and consistency 45 2.13. Other estimating criteria 47 2.14. Least squares and the criteria 47 2.15. Treatment of heteroskedasticity 48 Further readings 50 Chapter 3. Bias in models of decay 52 3.1. Introduction and summary 52 3.2. Violation of Simplifying Assumption 6 53 3.3. Conjugate samples 54 3.4. Source of bias 57 3.5. Extent of the bias 58 3.6. The nature of initial conditions 58 3.7. Unbiased estimation CO Further readings, . C2 Chapter 4. Pitfalls of simultaneous interdependence 63 4.1. Simultaneous interdependence 63 4.2. Exogenous variables 64 4.3. Haavelmo's proposition. 64 4.4. Simultaneous estimation 67 4.5. Generalization of the results 70 4.6. Bias in the secular consumption function 71 Further readings 71 Chapter 5. Many-equation linear models 73 5.1. Outline of the chapter 73 5.2. Effort-saving notation 74 5.3. The Six Simplifying Assumptions generalized 77 5.4. Stochastic independence 78 5.5. Interdependence of the estimates 82 5.6. Recursive models 83 Further readings 84 Chapter 6. Identification 85 6.1. Introduction 85 6.2. Completeness and nonsingularity 86 C0NTENT8 XlH 6.3. The reduced form 87 6.4. Over- and underdeterminacy $$ 6.5. Bogus structural equations. 89 6.6. Three definitions of exact identification 90 6.7. A priori constraints on the parameters 91 6.8. Constraints on parameter estimates . 94 6.9. Constraints on the stochastic assumptions 95 6.10. Identifiable parameters in an underidentificd equation 96 6.11. Source of ambiguity in overidentified models 98 6.12. Identification and the parameter space 100 6.13. Over- and underidentification contrasted . 102 6.14. Confluence 103 Further readings 106 Chapter 7. Instrumental variables 107 7.1. Terminology and results 107 7.2. The rationale of estimating parametric relationships 108 7.3. A single instrumental variable 110 7.4. Connection with the reduced form 113 7.5. Properties of the instrumental variable technique in the simplest case 114 7.6. Extensions 115 7.7. How to select instrumental variables 116 Chapter 8. Limited information 118 8.1. Introduction 118 8.2. The chain of causation 119 8.3. The rationale of limited information . . . 121 8.4. Formulas for limited information 123 8.5. Connection with the instrumental variable method. . . . . . 125 8.6. Connection with indirect least squares 125 Further readings 125 Chapter 9. The family of simultaneous estimating techniques . . 126 9.1. Introduction . 126 9.2. Theil's method of dual reduced forms 126 9.3. Treatment of models that are not exactly identified 128 9.4. The "natural state" of an econometric model 130 9.5. What are good forecat-ts? 132 Further readings. 134 Chapter 10. Searching for hypotheses cud testing them .... 136 10.1. Introduction 136 10.2. Discontinuous hypotheses 137 10.3. The null hypothesis 138 10.4. Examples of rival hypotheses <8 10.5. Linear confluence i42 10.6. Partial correlation 144 10.7. Standardized variables 145 10.8. Bunch map analysis 146 XIV CONTENTS 10.9. Testing for linearity 150 10.10. Linear versus ratio models 152 10.11. Split sectors versus sector variable 153 10.12. How hypotheses are chosen 154 Further readings 155 Chapter 11. Unspecified factors 15G 11.1. Reasons for unspecified factor analysis 150 11.2. A single unspecified variable 157 11.3. Several unspecified variables 159 11.4. Linear orthogonal factor analysis 1G0 11.5. Testing orthogonality 163 11.6. Factor analysis and variance analysis 164 Further readings 165 Chapter 12. Time series , 166 12.1. Introduction , 166 12.2. The time interval 167 12.3. Treatment of serial correlation 168 12.4. Linear systems „ 171 12.5. Fluctuations in trendless time series 172 12.6. Correlograms and kindred charts 174 12.7. Seasonal variation 176 12.8. Removing the trend. 178 12.9. How not to analyze time series 181 12.10. Several variables and time series 182 12.11. Time aeries generated by structural models 183 12.12. The over-all autorcgrcssion of the economy 185 12.13. Leading indicators . 186 12.14. The diffusion index . 188 12.15. Abuse of long-term series 190 12.16. Abuse of coverage 191 12.17. Disagreements between cross-section and time series estimates . . 192 Further readings 195 Appendix A. Layout of computations 197 The rules in detail 198 Matrix inversion . 201 B. Stepwise least squares 203 C. Subsample variances as estimators 207 D. Proof of least squares bias in models of decay .... 209 E. Completeness and stochastic independence 213 F. The asterisk notation. 214 Index 217 Digressions On the distinction between probability and probability density . . . 9 On the distinction between "population" and "universe" .... 11 On infinite variances 14 On the univariate normal distribution 14 On the differences among moment, expectation, and covarianc©. . . 13 On the multivariate normal distribution 26 On computational arrangement 32 On matrices of moments and their determinants 34 On notation 44 On arbitrary weights 50 On directional least squares 69 On Jacobians . 79 On the etymology of the term "multicollinearity" 105 On correlation and kindred concepts . 141 On moving averages and sums 188 xv Frequent references Names in small capital letters refer to the following works: Allen R. G. D. Allen, Mathematical Economics. New York: St. Mirtiis's Press, Inc., 1956. xvi, 768 pp., illus. Beach Earl F. Beach, Economic Models: An Exposition. N§w fork: John Wiley & Sons, Inc., 1957. xi, 227 pp., illus. Hood William C. Hood and Tjalling C. Koopmans (eds.), Studies in Econometric Method, Cowles Commission Monograph 14. New York: John Wiley & Sons, Inc., 1953. xix, 323 pp., illus. Kendall Maurice G. Kendall, The Advanced Theory of Statistics , voll, I, II. London: Charles Griffin & Co., Ltd., 1943; 5th ed„ 1952. Vol. I, m 4$7 pp., illus.; vol. II, vii, 521 pp. illus. Klein Lawrence R. Klein, A Textbook of Econometrics. Evanston: Row* Peterson & Company, 1953. ix, 355 pp., illus. Koopmans Tjalling C. Koopmans (ed.), Statistical Inference in Dynawie >w Economic Models, Cowles Commission Monograph 10. New Yofkj John Wiley & Sons, Inc>, 1950. xiv, 438 pp., illus. Tinbergen Jan Tinbergen, Econometrics, translated from the Dutch by H. Rijken van Olst. New York: McGraw-Hill Book Company, Inc?., Blakiston Division, 1951. xii, 258 pp., illus. Tintner Gerhard Tintner, Econometrics. New York: John Wiley ft Sons, Inc., 1952. xiii, 370 pp. xvii CHAPTER 1 The fundamental proposition of econometrics 1.1. What econometrics is about An econometrician's job is to express economic theories in mathe- matical terms in order to verify them by statistical methods, and to measure the impact of one economic variable on another so as to be able to predict future events or advise what economic policy should bs followed when such and such a result is desired. This definition describes the major divisions of econometrics, namely, specification, estimation, verification, and prediction. Specification has to do with expressing an economic theory in mathe- matical terms. This activity is also called model building. A model is a set of mathematical relations (usually equations) expressing an economic theory. Successful model building requires an artist's touch, a sense of what to leave out if the set is to be kept manageable, elegmit, and useful with the raw materials (collected data) that are available. This book deals only incidentally with the "specification" aspect of econometrics. The problem of estimation is to use shrewdly our all too scanty data, 1 2 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS so as to fill the formal equations that make up the model with numerical values that are good and trustworthy. Suppose we have the following simple theory to quantify: Consumption today (C) depends on yester- day's income (Z) in such a way that equal increments of income, no matter what income level you start from, always bring equal increments in consumption. Letting a stand for consumption at zero income and 7 for the_ marginal^ pjopensjty_tQ„iiQnsume, this theory can be expressed,, thus: C t = a + yZ t (1-1) The problem of estimation (and the main concern of this book) is to discover how to use whatever experience we have about consumption C and income Z in order to make a shrewd guess about how large a and 7 mi^.ht really be. The problem of estimation is to guess correctly a and 7, t he 'parameters (or inherent characteristics) of the consumption junction. J&rinLe&timaU&n is making the best possible single guess about a and about 7. Interval estimation is guessing how far ojir^uessjo^^may be from the true ^J^nd our guess of 7 from the true_x* It is not enough, of course, to be able to make correct point and interval estimates. We want to make them as cheaply as possible. Wg.HSJl^^^^^^Lfej^cfe.Wt. programming of computations , checks of accuracy, and shozLSUlS- Though this aspect of estimation will not occupy us very much, I shall give some computational advice from time to time. Verification sets up criteria of success and uses these criteria to accept or reject the economic theory we are testing with the model and our v data. It is a tricky subject deeply rooted in the mathematical theory of statistics. Prediction involves rearranging the model into convenient shape, so that we can feed it information about new developments in exogenous and lagged variables and grind out answers about the impact of these variables on the endogenous variables. 1.2. Mathematical tools In explaining how to fashion good estimates ft»r the parameters of an econometric model I shall often step into th$ mathematical statis- 1.3. OUTLINE OP PROCEDURE 3 tician's toolroom to bring out one gadget or another required by the next step of our procedure. These digressions are clearly marked m they can be skipped by those acquainted with the tool in question. The mathematical tools used again and again are elementary: analytic geometry, which makes equations and graphs interchangeably; probability, a concept enabling us to make precise statements ..about uncertain events; the derivative (or the operation of differentiating), which is a help in making a "best" guess among all possible guessej; moments, which are a sophisticated way of averaging various magni- tudes; and matrices, which are nothing but many-dimensional ordinary numbers — indeed, statements that are true of ordinary numbers seldom fail for matrices — for instance, you can add, subtract, multiply, and divide matrices analogously to numbers and, in general, handle them as if they were ordinary numbers though perhaps more fragile; a vector is a special kind of matrix. 1.3. Outline of procedure and main discoveries in the next hundred pages I. We shall deal first with models consisting of a single equation. We shall find that even in this simple case there are important difficulties. A. It is not always possible to estimate the parameters of even a single-equation model, for two sorts of reason: 1. f e may lack enough data. This is called the 'problem of degrees of freedom. 2. Though the data are plentiful, they may not be rich or varied enough. This is the problem of ' multicollinegjtifa. B. Our second important finding will be that "pedestrian" methods of estimation, for example the least squares fit, are apt to be treacherous. They either give us erroneous impres- sions about the true values of the parameters or waste the data. II. Turning then to models containing two or more equations, our main findings will be the following: A. It is sometimes impossible to determine the value of each parameter in each equation, but this time not merely for lack of data or their monotony, but rather because the equations look too much like one another to be disentangled. Econ- 4 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS ometricians call this undesirable property lack of identifiability. B. Nonpedestrian, statistically sophisticated methods become very complex and costly to compute when the model increases from a single equation even to two. C. Happily, however, by sacrificing some of the rigor of these ideal " equestrian " methods in special, shrewd ways, we can cut the burden of computation by a factor of 5 or 10 and still get pretty good results. Such techniques are called limited i^ormation^Jzchmj&uej, because they deliberately disregard refinements that should ideally be taken into account. Most theoretical econometricians work in this field, because the need is very great to know not only how to boil down complexity with clever tricks but also precisely how much each trick costs us in accuracy. 1*4. All-importance of statistical assumptions The key word in estimation is the word stgdwslic. ItSwQPJiQSJteJis^ exact or systematic. Stochastic comes from the Greek stokhos (a target, or bull's-eye). Tjie outcome of throwing darts is a stochastic process, that is to say, fraught with occasional misses. In economics, indeed in all empirical disciplines, we do not expect our predictions to hit the bull's-eye 100 per cent of the time. Econometrics begins by saying a great deal more about this matter of missing the mark. Where ordinary economic theory merely recog- nizes that we miss the mark now and then, econometrics makes statistical assumptions. These are precise statements about the par- ticular way the darts hit the target's rim or hit the wall. Everything — estimation, prediction, and verification — depends vitally on the content of the statistical assumptions. Econometric models emphasize this TacTby using a special variable jw called the error term. The error term varies from instance to instance, just as one dart falls above, another below, one to the left, another to the right, of the target. A subscript t serves to indicate the various values of the error term. To make model (1-1) stochastic, we write C t = a + yZ t + u t (1-2) 1.5. RATIONALIZATION OP THE ERROR TERM 5 Before going on to rationalize the presence of the error term u m equation (1-2), two things must be explained. First, u could have been included as a multiplicative factor or as an exponential rather than as an additive term. Second^ its subscript t need not express time. It can refer just as well to various countries or income classes. To facilitate the exposition, I shall henceforth treat u t as an additive term and take / to represent time. Exceptions will be clearly labeled. The common-sense interpretation of additivity is deferred to Sec. 1.11. 1.5. Rationalization of the error term There are four types of reasons why an econometric model should be stochastic and not exact: incomplete theory, imperfect specification^ aggregation of data, and errors of measurement. Not all of them apply to every model. 1. Incomplete theory A theory is necessarily incomplete, an abstraction that cannot explain everything. For instance Jaimoimple theory of consumption*, a. We have left out possible variables, like wealth and liquid assets^ that also affect consumption. b. We have left out equations. The economy is much more complex than a single equation, no matter how many explanatory variables this single equation may contain; there may be other links between con- sumption and income besides the consumption function. 1 c. Human behavior is "ultimately" random. 2. Imperfect specification We have linearized a possibly nonlinear relationship. 3. Aggregation of data We have aggregated over dissimilar individuals. Even if each of them possessed his own a and y and if his consumption reacted in exact (nonstochastic) fashion to his past income, total consumption would not be likely to react exactly in response to a given total income, because its distribution may change. Another way of putting this is: 1 How many independent links there may be and how we are to find them is itself a problem in statistical inference, and is treated briefly in Chap. 10. 6 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS Variables expressing individual peculiarities are missing (cf. la). Or this way: Equations that describe income distribution are missing (cf. 16). 4. Errors of measurement Even if behavior were exact, survey methods are not, and our statistical series for consumption and income contain some errors of measurement. Throughout this book we pretend that all variables are measured without error. 1.6. The fundamental proposition All we get out of an econometric model is already implied in: 1. Its specification; that is to say, consumption C depends on yester- day's income Z as in the equation C% = a + yZ t + u t and in no other way 1 2. Our assumptions concerning u, that is to say, the particular way we suppose the relationship between C and Z to be inexact 2 3. Our sampling procedure, namely, the way we arrange to get data 4. Our sample, i.e., the particular data that happen to turn up after we decide how to look for them 5. Our estimating criterion, i.e., what properties we desire our estimates of a and 7 to have, short of the unattainable: absolute correctness Over items 1, 2, 3, and 5 we have absolute control; for we are free to change our theory of consumption, our set of assumptions concerning the error term, our data-collecting techniques, and our estimating criterion. We have no control over item 4; for what data actually turn up is a matter of luck. According to this fundamental proposition, what estimates we get for the parameters (a and 7) depends, among other things, on the stochastic assumptions, i.e., what we choose to suppose about the behavior of the error term u. Every set of assumptions about the error term prescribes a certain way of guessing at the true value of the parameters. And conversely, every guess about the parameters is Implicitly a set of stochastic assumptions. ! This is also called "structural specification." 5 This is also called "stochastic specification." 1.7. POPULATION AND SAMPLE 7 The relationship between stochastic assumptions and parameter estimates is not always a one-to-one relationship. A given set of stochastic assumptions is compatible with several sets of different parameter estimates, and conversely. In practice, we don't have to worry about these possibilities, because we shall be making assump- tions about u that lead to unique guesses about a and 7, or, at the very worst, to a few different guesses. Also, in practice, since we usually are interested in a and 7, not in verifying our assumptions about u, it does not matter that many different u assumptions are compatible with a single set of parameter estimates. 1,7. Population and sample The whole of statistics rests on the distinction between population and observation or sample. People were receiving income and con- suming it long before econometrics and statistics were dreamed of. There is, so to speak, an underlying population of C's and Z's, which we can enumerate, hypothetically, as follows: C\, C2, . . . , Cp, . . ,c P or C p forp - 1, 2, . . . ,P Zi, Z2, . • > Zpi ' . . ,Z P or Z p for p - 1, 2, . . . , P Of these C's and Z'b we may have observed all or some. Those that we have observed take on a different index s instead of p, to emphasize that they form a subset of the population. All we observe, then, is C, and Z„ where the s assumes some (perhaps all) of the values that p runs, but no more. Index s can start running anywhere, say, at p = 5, assume the values 6 and 7, skip 8, 25, and 92, and stop anywhere short of the value P or at P itself. In all cases that I shall discuss, the sample covers consecutive time periods, which are renumbered, for convenience, in such a way that the beginning of time coincides with the beginning of the sample, not of the population. Whether the sample is consecutive or not sometimes does and sometimes does not affect the estimation of a and 7. Note that we mean by the term sample a given collection of observe* tions, like S — (C$>,CioAi; Z 9) Zio,Zu), not an isolated observation. S is a sample of three observations, the following ones: {C^Z^) } (Cio,^ie), and (Cn,Zii). Samples made up of a single observation can exist, of course, but we seldom work with so little observation. 8 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS 1.8. Parameters and estimates Another crucial distinction is between a parameter and its estimate. If the theory is correct, there are, hiding somewhere, the true a and 7. These we never observe. What we do do is guess at them, basing ourselves on such evidence and common sense as we may have. The guesses, or estimates, always wear a badge to distinguish them from the parameters themselves. CONVENTION We shall use three kinds of badge for a parameter estimate: a roof- shaped hat, as in 4, i, to mark maximum likelihood estimates; a bird, the same symbol upside down, as in a, 7, for naive least squares esti- mates; and the wiggle, as in a, 7, for other kinds of estimates or for estimates whose kind we do not wish to specify. These types of estimate are defined in Chap. 2. The distinction between error and residual is analogous to the distinction between parameter and estimate. The error u t is never observed, although we may speculate about its behavior. It always goes with the real a and 7, as in (1-2), whereas the residual, which is an estimate of the error and whose symbol always wears a. distinctive badge, can be calculated, provided we have settled on a particular guess (<2,f ) or (a, 7) for the parameters. The value of the error does not depend on our guessing; it is just there, in the population and, therefor©, in the sample. The residual, however, depends on the particular guess. To emphasize this fact we put the same badge on the residual as on the corresponding parameter estimate. We write, for example, C t = & + ^Z t + Ut or C t = a -J- yZ t + U t . Now we can state precisely what the problem of estimation is, as follows. We assume a theory, for example, the theory of consumption C$ =- a 4- 7^1 + u t ; we assume that Ut behaves in some particular way (to which the next section is devoted) ; we get a set of observations on and Z (the sample). Then we manipulate the sample data to give us estimates a and 7 that satisfy our estimating criterion (dis- cussed in Chap. 2). Then we compute if we wish the residuals U t as Climates of the errors u t . 1.9. ASSUMPTIONS ABOUT THE ERROR TERM 9 1.9. Assumptions about the error term Besides additivity (Sec. 1.4) we shall now make and interpret m% assumptions about the error term. Of these, the first is indispensable, exactly as it stands. The other five could be different in content or in number. Note carefully that these are statements about the u's not the #'s. Assumption 1. u is a random real variable. If the model is stochastic, either its systematic variables (eonsump* tion or income) are measured with errors, or the consumption function itself is subject to random disturbances, or both. Since we have ruled out (Sec. 1.5) errors of measurement, the relationship itself km to \m stochastic. " Random" is one of those words whose meaning everybody knows but few can define. Unpardonably, few standard texts glv© its definition. A variable is random if it takes on a number of different values, each with a certain probability. Its different values can be infinite in number and can range all over the field, provided th@r<§ are at least two of them. For instance, a variable w that is equal to — y% twenty-five per cent of the time, to 3 + \/2 forty per cent of the time, and to +35.3 thirty-five per cent of the time is a random variable. We may or may not know what values it takes on or their probabili- ties. Its probability distribution may or may not be expressible analytically. (See Sec. 2.2.) Digression on the distinction between probability and probability density A random variable w can be discrete^ like the number of remain- ing teeth in members of a group of people, or continuoxm, Ukf their weight. If w takes on a finite number of values, their probabili- ties can be quite simply plotted on and read off a dot di®%?&m or point graph (Fig. la). With a continuous variable we can usually speak only of the probability that its value should lie between (or at) such and such limits. In this case, we plot a probability-density graph 10 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS (Fig. 16). The height of such a graph at a point is the probability density } and the relative area under the graph between any two points of the w axis is the probability. Assumption 2. u t , for every t, has zero expected value. Naively interpreted, this proposition says that the "average" value of Wi is zero, that of u 2 is also zero, and so forth. Or, to put it dif- ferently, it says that a prediction like C\ = a + yZi is "on the average" correct, that the same is true of C2 = a + 7^2, and so forth. 1 - 2 4 - 1 1 1 1*1 1 ! \ \\\ j. _ 1 1 1 1 1 I 3+V2" 35.3 w 5 w (a) (6) Fsg. 1. A random variable, a. Variable w is discrete* The illustration is a dot diagram, or point graph, b. Variable w is continuous. The illustration is a probability-density graph. The trouble begins when you begin to wonder what "on the average" can possibly mean if you stick to a single instance, like time period 1 (or time period 2). For every event happens in a particular way and not otherwise. Suppose, for instance, that in the year 1776 (t — 1776) consumption fell short of its theoretical level a + yZme by 2.24 million dollars, that is, that Ume = —2.24. Obviously, then, the average value of Wme is exactly —2.24. What could we possibly wish to convey by the statement that Wine (and every other u t ) has zero expected value? One should never identify the concept of expected value with the concept of the arithmetic mean. Arithmetic mean denotes the sum of a set of N numbers divided by N and is an algebraic concept. Expected 1.9. ASSUMPTIONS ABOUT THE ERROR TERM 11 value is a statistical or combinatorial concept. You have to imagine an urn containing a (finite or infinite) number of balls, each with a number written on it. Consider now all the possible ways one could draw one ball from such an urn. The arithmetic mean of the numbers that would turn up if we exhausted all possible ways of drawing one ball is the expected value. The random term of an econometric model is assumed to come from an Urn of Nature which, at every moment of time, contains bails with numbers that add up to zero. The common-sense interpretation of Nature's Urn is as follows: Though in 1776 actual consumption in fact fell short of the theoretical by 2.24 and no other amount, the many causes that interacted to produce Wme = —2.24 could have interacted (in 1776) in various other ways. This, theoretically, they were free to do since they were random causes. Now, try to think of all conceivable combinations of these causes — or if you prefer, think of very many 1776 worlds, identical in all respects except in the combinations of random causes that generated the random term. Let us have as many such worlds as there are theoretical combinations of the causes behind the random term. In some worlds the causes act overwhelmingly to make con- sumption lower than the nonstochastic level a -f- T^me J in other worlds the causes act so as to make it greater than a + 7^n?eJ and in a few worlds the causes cancel out, so that Cm* = « + T^me exactly. Now consider the random terms of all possible worlds, and (says the assumption) they will average out to zero. This interpretation is a conceptual model we can never hope to prove or disprove. Its chief merit is that it reduces chance and statistics to the (relatively) easy language and theorems of com- binatorial algebra. Some people take it seriously; others (myself included) use it for lack of anything better. Digression on the distinction between "population" Whether or not we take Nature's Urn seriously, we will be well advised to acknowledge that we are dealing with three levels of discourse, not just the two that I called population and sample. The third and deeper level is called the universe. It contains all 12 THE FUNDAMENTAL PROPOSITION OP ECONOMETRICS events as they have happened and as they might have happened if everything else had remained the same but the random shocks. Level I Sample : things that both happened and^were observed. It is drawn from Level II Population : things that happened but were not neces- sarily observed. It is drawn from Level III Universe: all things that could have happened. (In the nature of things only a few did.) CONVENTION We shall henceforth use four types of running subscript: 8 - 1, 2, . . • ,8 for the sample p = 1, 2, . . ■ ,P for the population i=l,2,.. ■ ,1 for the universe t m 1, 2, . . • ,T for instances in general, whether they come from the sample, the population, or the universe In a sense the population (C P ,Z P ) of consumption and income as they actually happened in recorded and unrecorded history is merely a sample from the universe (C i} Z t ) of all possible worlds. Naturally, what we call the sample is drawn from the population of actual events, not from the hypothetical universe of level III. In most instances it does no harm to speak (and prove theorems) as if level I were picked directly from level III, not from level II. The Platonic universe of level III is indeed rather unseemly for the field of statistics (which is surely, in lay opinion, the most hard-boiled of mathematics, resting solidly on " facts") and has been amusingly ridiculed by Hogben. 1 The next few paragraphs state why the abstract model of Nature's Urn is a less appropriate foundation for econometrics than for statistical physics or biology. But the rest of the book goes merrily on using the said Urn. Economic and physical phenomena alike take place in time. 1 Lancelot Hogben, F. R. S., Statistical Theory: The Relationship of Probability, Credibility, and Error; An Examination of the Contemporary Crisis in Statistical Theory from a Behaviourist Viewpoint, pp. 98-105 (London: George Allen & Unwin Ltd., 1957). 1.9. ASSUMPTIONS ABOUT THE ERROR TERM 13 In both fields, the statement that u t is a random variable for each t is inevitably an abstraction, because time runs on inexorably. In the physical sciences events are deemed "repeatable," or aris- ing from a universe "fixed in repeated samples," primarily because the experimenter can ideally replicate exactly all system- atically significant conditions that had surrounded his original event. This is not possible in social phenomena of the " irre- versible" or "progressive" type. Although in the physical sciences it may be safe to neglect the difference between popula- tion and universe, it is unsafe in econometrics. For, as economic phenomena take place in time, all other conditions, including the exogenous variables, move on to new levels, often never to return. The common-sense phrase " on the average over similar experiments" makes much more sense in a laboratory science than in economics. Nature's Urn also supports maximum likelihood, variance of an estimate, bias, consistency, and many other notions we shall have occasion to introduce in later chapters. All these rest on the notion of "all conceivable samples." The class of all conceivable samples includes first of all samples of all conceivable sizes; it also includes all conceivable samples of a given size^ say, 4. A sample of size 4 may consist of points that actually happened (if so, they are in the population) ; it also could consist (partly or entirely) of points in the universe but not in the population. The latter kind of sample is easy to conceive but impossible to draw, because the imagined points never "happened." Therefore, even a com- plete census of what happened is not enough for constructing an exhaustive list of all conceivable samples. Assumption 3. The variance of u t is constant over time. This means merely that, in each year, u t is drawn from an identical Urn, or universe. This assumption states that the causes underlying the random term remain unchanged in number, relative importance, and absolute impact, although, in any particular year, one or another of them may fail to operate. For simplicity's sake we have assumed no errors of measurement. In fact there may be some, and their typical size could vary system- atically with time (or with the independent variable Z). If we try 14 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS to measure the diameter of a distant star, our error of measurement is likely to be several million miles; when we measure the diameter of Sputnik, it can be only a few feet. Likewise, if our data stretch from 3850 to 1050, national income increases by a factor of 20. It is quite likely that errors of measurement, too, should increase absolutely. If they do, Assumption 3 is violated, and some of the techniques that I develop below should not be used. Digression on infinite variances The variance of u is not only constant but finite. When u is normal it is unnecessary to stipulate that its variance is finite, because all nondegenerate normal distributions have finite vari- ance. There exist, however, nondegenerate distributions with zero mean and infinite variance, for example, the discrete random variable . . . , -16, -8, -4, 4, 8, 16, with probabilities, respectively, U» \/« U V l / 16 H, K, H, He, The central limit theorem, according to which the sum of N random variables (distributed in any form) approaches the normal distribution for large N t is valid only if the original distri- butions have finite variances. Assumption 4. The error term is normally distributed. This is a rather strong restriction. We impose it mainly because normal distributions are easy to work with. Digression on the univariate normal distribution The single- variable normal distribution is shaped like a sym- metrical two-dimensional bell whose mouth is wider than the mouth of anything you might name. Normal distributions come tall, medium, and squat (i.e., with small, medium, or large vari- ances). And the top of the bell can be over any point of the 1.9. ASSUMPTIONS ABOUT THE ERROR TERM 15 w axis; that is to say, the mean of the normal can be negative, zero, or positive, large or small. This distribution's chief charac- teristic is that extreme values are more and more unlikely the more extreme they get. For instance, the likelihood that all the people christened John and living in London will die today is extremely small, and the likelihood that none of them will die today is equally small. Now why is this? Because London is not under atomic attack, the Johns are not all aboard a single bus, not all of them are diving from the London Bridge, nor were they all born 85 years ago. Each goes about his business more or less independently of the others (except, perhaps, father-and-son teams of Johns), some old, some young, some exposing themselves to danger and others not. The reason why the probability that w of these Johns will die today approximates the normal is that there are very many of them and that each is subjected to a vast number of independent influences, like age, food, heredity, job, and so forth. This probability would not be normal if the Johns were really few, if the causes working toward their deaths were few, or if such causes were many but linked with one another. The assumption that u is normal is justified if we can show that the variables left out of equation (1-2) are infinitely numerous and not interlinked. If they are merely very many and not interlinked, then u is approximately normal. If they are infinitely many but enough of them are interlinked, then u is not even approximately normal. We often know or suspect that these variables, such as wealth, liquid balances, age, residence, and so forth, are quite interlinked and are very likely to be present together or absent together. Sometimes the following argument is advanced: In our model of consumption, the error term u stems from many sources; that is, we have left out variables, we have left out equations, we have linearized, we have aggregated, and so on. These are all different operations, presumably not linked with one another. Therefore, u is normally distributed. This argument is, of course, a bad heuristic argument, and it does not even stand for an existing (but difficult) rigorous argument. It 16 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS Is logically untidy to count as arguments for the normality of u, on one and the same level of discourse, such diverse items as the fact of linearization and the number of unspecified variables that affect consumption. The assumption stands or falls on the argument of many non- interlinked absent variables. Most alternative assumptions cause great computational grief. Assumption 5. The random terms of different time periods are independent. This assumption requires that in each period the causes that deter- mine the random term act independently of their behavior in ail previous and subsequent periods. It is easy to violate this assumption. 1. The error term includes variables that act cyclically. If, for example, we think consumption has a bulge every 3 years because that is how often we get substantially remodeled cars, this effect should be introduced as a separate variable and not included in u. 2. The model is subject to cobweb phenomena. Suppose that consumers in year 1 (for any reason) underestimate their income, so that they consume less than the theoretical amount. Then in year 2 they discover the error and make it up by consuming more than the theoretical amount of year 2; and so on. 3. One of the causes behind the random term may be an employee's vacation, which is usually in force for 2 weeks though the model's unit period is 1 week. Any such behavior violates the requirement that the error of any period be independent of the error in all previous periods, Assumption 0. The error is not correlated with any predetermined variable. To appreciate this assumption, suppose that (for whatever reason) sellers set today's price p t on the basis of the change in the quantity sold yesterday over the day before; that is, Vt = « + 7te-t - qt-z) + u t Suppose, further, that the greater (and more evident) the change in q 1.10. MATHEMATICAL RESTATEMENT 17 the more they strive to set a price according to the above rule. Such behavior violates Assumption 6. We can think of examples where behavior is fairly exact (small u's) for moderate values of the independent variable, but quite erratic for very large or very small values of the independent variable. I am apt, for instance, to stumble more if there is either too little or too much light in the room. This, again, violates Assumption 6, because the error in the stochastic equation describing my motion depends on the intensity of light. So we come to the end of our statistical assumptions about the error u. When in future discussion I speak of u as having "all the Simplify- ing Properties" (or as "satisfying all the Simplifying Assumptions"), I mean exactly these six. Certain of these six assumptions can bo checked or statistically verified from a sample; others cannot. I shall return to this topic later. Of these assumptions only Assumption 1 is obligatory. There are decent estimating procedures for other sets of assumptions. 1.10. Mathematical restatement of the Six Simplifying Assumptions 1. u t is random for every t: Some p(u) is defined for all % such that < p < 1 and J p(u) du = 1 2. The expected value of u t is zero: tu t = for all t 3. The variance (r„«(0 is constant in time, and finite: < <T U u{t) = cov (u h u t ) = <Tuu < °° for all t 4. u t is normal: p(u) - (2t)->* det (*«u)->* exp [-^(w - ew)((r„«)- I (w - 8w)] I explain this fancy notation in the next chapter. I use it because it 18 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS generalizes very handily into many dimensions. The usual way to write the normal distribution is The symbol <r„ u is the square of cr; <r tttt is the variance of u. 5. u is not auto correlated: Z(u t> u t -e) = for all t and for 6 j£ 6. u is fully independent of the variable Z\ cov (u t ,Zt-e) = for all t and all 1.11. Interpretation of additivity The random term u appears in model (1-2) as an additive term. This fact rules out interaction effects between u and Z. Absence of interaction effects means that, no matter what the level of income Z may be, a random term of a given magnitude always has the same effect on consumption. Its impact does not depend on the level of income. 1.12. Recapitulation We must be very clear in econometrics, as well as in other areas of statistical inference, about what is assumed, what is observed, and what is guessed, and also about what criterion the guess should satisfy. Table 1.1 provides a check list of things we accept by assumption, things we can and cannot do, and things we must do in making statistical estimates. The items in the first three columns have been introduced in this chapter; the estimating criteria in the fourth column will be discussed in the next chapter. Digression on the differences among moment, expectation, and covariance Consider two variables, consumption c and yesterday's income 2. They may or may not be functionally related. They have a 1.12. RECAPITULATION 19 Table 1.1 Thebb things are ASSUMED These things are observed These things ARE NOT OBSERVED This is imposed These things aru computed by us That a true a and a true 7 exist The true a and 7 Some estimat- ing criterion for comput- ing a, 7, S a, a guess as to « 7, a guess as to 7 That a true «< exists in each time period The true u< The residuals SU (4 - 1,2, , , , ,B) That ut has the Six Simplifying Properties «u, the ex- pected value ff«u, the vari- ance of the error (%%,)/$, the mean of the rosiduals 5* tnzz, the moment of the residuals h. That there is a universe d, Zi (t - 1,2 /) in which Ci " a + yZi + W The C's and Z'» of the sample denoted by C, Z, (« - 1,2 S) The C's and Z'b not in the sample universe U - f Cl ' • • > • • 9 Expectation The average value of c in the universe is symbolized by Zc (read "expected value of c" or "expectation of o"). Similarly, e« is the average z m J/te universe, Covariance The covariance of c and z is defined as the expected value Efe - ic)(* - es) where i runs over J/te entire universe. This is symbolized by cov (c,0) or <r Cf . Variance The variance of c is simply the. covariance of c and c. It is written var c, or cov (c,c), or (r cc . 20 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS Now consider a specific sample S° made up of specific (corre- sponding) pairs of consumption and income, for instance, [C27, C64, ClOS I Z27, 254, Z105J Let the sample means for this particular sample be written c° and z°, respectively. Moment The moment (for sample S°) of c on t is defined as the expected value E($ - c')(z 9 - 2?) where s runs over 27, 54, and 105 only. It is symbolized by m c . e (S°) or m c . f or simply m ct . Of course, a different sample S 1 would give a different moment m c . t (S l ). Expectation of a moment Now consider all samples of size 3 that we can draw (with replacement) from the universe U. Then the expectation of m C9 is the average of the various moments m e . M (S°), m c .,(S 1 ), etc., when all conceivable samples of size 3 are taken into account. A universe with J elements generates f ) such samples, and the means c and z of the two variables vary from sample to sample. The expectation of m ct for samples of size 4 is, in general, a different value altogether. Much confusion will be avoided later on if these distinctions are kept in mind. Clear up any questions by doing the exercises below. Exercises l.A If c, c', c", c* are four independent drawings (with replace- ment) from the universe U,^prove~ that e(c' -f c" + c*) — 3 ec. l.B If c, z } q are variables and A; is a constant, which of the follow- ing relations are identities? 21 cov (c,z) = cov («,c) • J. 'met = m»e • e(c -f z) = sc -f ez • v cov (A;c,^) = A; cov (c,z) • var (ft?) = k 2 var 5 , x m(*e).a = fcm M • cov (c + $,«) = cov (c, z) 4- cov (q,z) / Further readings Tlic art of model specification is learned by practice and by studying cleverly contrived models. Beach, 1 Klein, Tinberoen, and Tintnee give several examples. L. R. Klein and A. S. Goldberger, An Econometric Model of the United States: 1929-1952 (Amsterdam: North-Holland Publishing Company, 1957), present a celebrated large-scale econometric model. Chap- ters 1 and 2 give a good idea of the difficulties of estimation. The performance of this model is appraised by Karl A. Fox in "Econometric Models of tha United States" {Journal of Political Economy ; vol. 44, no. 2, pp s 128-143, April, 195G) and by Carl Christ in "Aggregate Econometric Models" (American Economic Review, vol. 46, no. 3, pp. 385-408, June, 1956). "The Dynamics of the Onion Market," by Daniel B. Suit! and Susumu Koizumi (Journal of Farm Economics, vol. 38, no. 2, pp. 475-484* May, 1956)^ is an interesting example of econometrics applied to a particular market In the short run. Kendall, chap. 7, reviews the logic of probability, sampling, and expected value. For a lucid discussion of the concept of randomness, iej M« G. Kendall, "A Theory of Randomness" (Biomelrika, vol. 32, pt. 1, pp. 1-15, January, 1941). As far as I know, the assumptions about the random terra have not been discussed systematically from the economic point of view, except for Mar- schak's brief passage (pp. 12-15) in Hood, chap. 1, sec. 7. See also Gerhard Tintncr, The Variate Difference Method (Cowles Commission Monograph 5, pp. 4-5 and appendixes VI, VII, Bloomington, Indiana: Prineipia Press, 1940), and Tintner, "A Note on Economic Aspects of the Theory of Errors in Time Series" {Quarterly Journal of Economics, vol. 53, no. 1, pp. 141-149, November, 1938). As defined in economics textbooks, the production function and the cost function necessarily violate Assumption 2. In no instance (whether in the universe, the population, or the sample) can the random disturbance exceed zero in the production function and fall short of zero in the average cost 1 See Frequent References at front of book. Works of authors whose names are capitalized are listed there. 22 THE FUNDAMENTAL PROPOSITION OF ECONOMETRICS function. All statistical studies of production and cost functions I know of have implicitly used the assumption that zu «■ 0. The error is in the assump- tion of normality. Sco, for instance, Joel Dean, "Department Store Cost Functions," in Studies in Mathematical Economics and Econometrics, in memory of Henry Schultz, edited by Oscar Langc, Francis Mclntyre, and Theodore 0. Yntema (p. 222, Chicago: University of Chicago Press, 1942), which is also an interesting attempt to fit static cost functions to data from years of large dynamic changes. In this respect I was guilty myself in "An Econometric Model of Growth: U.S.A. 1869-1953" {American Economic Review, vol. 45, no. 2, pp. 208-221, May, 1955). For examples of nonadditive disturbances, see Hurwicz, "Systems with Nonadditivo Disturbances," chap. 18 of Koopmans, pp. 410-418. CHAPTER 2 Estimating criteria and the method of least squares 2.1. Outline of the chapter This chapter, like the previous one, deals exclusively with single- equation models. Unless the contrary is stated, all the Simplifying Assumptions of Sec. 1.9 remain in force. The main points of this chapter are the following: 1. Once we have specified the model and made certain stochastic assumptions, our sample tells us nothing about the unknown parame- ters of the model unless we adopt an estimating criterion. 2. A very reasonable (and hard to replace) criterion is maximum likelihood. It is based on the assumption that, while we were taking the sample, Nature performed for our benefit the most likely thing, or generated for us her most probable sample. 3. Once the maximum likelihood criterion is adopted, we can tell precisely what the unknown parameters must be if our sample was the most likely to turn up. This is what is called maximizing the likelihood function. We find the unknowns by manipulating this function. 23 24 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES 4. The familiar least squares fit arises as a special case of the oper- ation of maximizing the likelihood function. 5. In many cases, adopting the maximum likelihood criterion auto- matically generates estimates that conform to other estimating criteria, for example unbiasedness y consistency, efficiency. 6. If estimates of the unknown parameters are unbiased, consistent, etc., this does not mean that our particular sample or method has given us a correct estimate. It means that, if we had infinite facilities (or infinite patience), we could get a correct estimate "in the long run" or "on the average." 7. The likelihood function not only tells us what values of the parameters give the greatest probability to the observed event but also attaches to such values degrees of credence, or reliability. Though tinse statements can be made about all sorts of models, the single-equation model of consumption that I have been using all along captures the spirit of the procedure. Multi-equation models have all the complications of single-equation models plus many others. 2.2. Probability and likelihood In common speech, probability and likelihood are but Latin and Saxon doublets. In statistics the two terms, though often inter- changed for the sake of variety or style, have distinct meanings. Probability is a property of the sample; likelihood is a property of the unknown parameter values. Probability Imagine that, in a model that described Nature's workings perfectly, the true values of the parameters a, /?, 7, . . . were such and such and that the true stochastic properties of the error term u were such and such. We would then say that certain types of natural behavior (i.e., certain samples or observations) were more probable than others. For example, if you knew that a river flowed gently southward at a speed of 3 miles per hour, that an engineiess boat drifting on it had such and such dimensions, weight, and friction (the model); if, in addition, you knew that gentle breezes usually blow -in the area, very 2.2. PROBABILITY AND LIKELIHOOD 25 rarely faster than 5 miles per hour, and that they usually blow now in one, now in another direction (the stochastic properties) ; then you would be very much surprised to find an instance in which the boat had traveled 25 miles northward or 30 miles southward in the space of 2 hours (the improbable behavior). Likelihood Now reverse the position. If you were sure of your information about the wind, if you did not know which way or how fast the river flowed, but you observed the boat 28 miles south of where it was 2 hours ago and were willing to assume that Nature took the most probable action while you happened to be observing her, then you would infer that the river must have a southward current of 14 miles per hour. This is the maximum likelihood estimate, or most likely (NOT most probable) speed of the river on the evidence of this unfortunate sample. Any other southward speed and any kind of northward flow are highly unlikely, or less likely than 14 miles per hour south. To say that any other speed is less probable is to misuse the term. The river's speed is what it is (3 miles per hour to the south) and it cannot be more or less probable. What can be more or less probable is the particular observation: that the boat has traveled southward 28 miles. This observation is very improbable if the river indeed flows 3 miles per hour southward. It would be more probable if the river flowed southward with a speed of 5, 7, or 10 miles. And it would be most probable if the true speed of the river had been 14 miles per hour. Evidently, a maximum likelihood guess can be very far from the truth. All estimation in econometrics operates as in this river example, no matter how elaborate the model, sloppy or exquisite the sample. What is so commendable about the maximum likelihood criterion, if it cannot guarantee us correct or even nearly correct results? Why assume that Nature will do the most likely thing? All I can say to this is to ask: Well, what shall we assume instead? the second most likely thing? the seventy-first? It is true that (in some cases) maximum likelihood estimates tend to be correct estimates "on the average" or "in the long run" (see Sees. 2.1 and 2.10)". These facts, however, are irrelevant, because we 26 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES use the maximum likelihood criterion even when we plan neither to repeat the experiment nor to enlarge the sample. It is very important to appreciate just what maximum likelihood estimation docs: The experimenter makes one observation, say, that the boat had traveled 28 miles southward in 2 hours; he then asserts hopefully (he does not know this) that the wind has been calm, because this is the most typical total net wind speed for all conceivable 2-hour stretches; and so he lets his estimate of the speed be 14 miles. Actually, we (who happen to know that the true speed is 3 miles) realize that, while the experimenter was busy measuring, the weather was not at all typical but happened to be the improbable case of 2 hours of strong southerly wind. The same experimenter under different circumstances might esti- mate the speed to be G, 0.5, —2, —3.0, etc., miles per hour depending on the wind's actual whim during the 2-hour interval in which observa- tion took place. Exercise 2.A Set up an econometric model of the river-and-boat example of Sec. 2.2, using the following symbols: dt for the number of miles (from a fixed point) traveled southward in t hours by the boat, y for the (unknown) speed of the river in miles per hour, and u t for the net southbound component of the wind's speed in miles per hour. Let u t have the following stochastic specification : 10 per cent of the time u t — 11 (southbound) 70 per cent of the time u t = (calm) 10 per cent of the time u t = — 5 (northbound) 10 per cent of the time u% = — 6 (northbound) Construct a probability table giving the net wind effects for 2 hours in succession. For each typr< of conceivable observation, derive the maximum likelihood estimate of 7. Digression on the multivariate normal distribution The univariate normal distribution for a variable u with universe mean eu and variance «r uu was written in Sec. 1.10 in the fancy form 2.2. PROBABILITY AND LIKELIHOOD 27 p(u) = (2ir)-w det (*„„)-» exp [-K(w - ctiXO" 1 ^ - *01 (2-1) because of the ease with which it generalizes to the multivariate case. Let u\ y Ui, . . . , un be N variables which have a joint normal distribution. Define U = VeC (UlyUiy . . . } un) eu « vec (ewi,£W2, . . . ,zun) <r uu — «i«i J^U N Ui For <r„ lttJ we often write cov (wi,W2), or simply c n if the meaning is clear from the context. Sometimes the inverse of <r att , usually written (O" 1 * is written <r uu , and its elements are written a***' or just o^. These superscripts are not exponents. If we need to write an exponent we write it outside parentheses, as in equa- tion (2-1). To get the multivariate distribution for u\ f Ut f . . . , us, all we need to do is change the italic u f 8 of (2-1) into bold characters: p(u) = (2tt)-"/ 2 det (cr uu )->* exp [-*4(u - eu)(cr uu )-»(u - Su)] (2-2) This illustrates the principle noted in Sec. 1.2: that if an oper- ation, theorem, property, etc., holds for simple numbers, it holds analogously for matrices. This is a great convenience, because you can pretend that matrices are numbers and so collapse a complicated formula into a shorter and more intuitive expression. Moreover, by pretending a matrix is a number, you can get a clear impression of what a formula conveys. Exercises 2.B Write explicitly the joint normal distribution of the two variables x and w. 2.C In Exercise 2.B, modify the formula for <r xx = aw and o-,,* » 0. 28 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES 2.D Write in vector and matrix notation the formula N N — H 2, 2 ( Wm *"" eW ^ nn (Pn— tW n ) tn- 1 n — 1 2.3. The concept of likelihood function Consider again the river-and-boat illustration of the previous sec- tion. Our information can come in one of several different ways. 1, A sample of one observation Someone may have sighted the boat at the zero point at twelve o'clock and 28 miles south of that 2 hours later. This is one observa- tion; it leads to i — 2 % = 14 miles per hour southward as the maxi- mum likelihood estimate of the river's speed. The number of hours elapsed from the beginning to the end of the observation could have been 1 or J^ or 7 or anything else. 2. A sample of several independent observations We may have several observations like the above but made on different days. For instance, Observation Time elapsed Distance traveled i 2 hours 28 miles south 2 4 hours 12 miles south 3 17 hours 44 miles south 3. Several interdependent observations Observations may overlap ; as, for example, Observation Time of observation Distance traveled a 12 to 2 p.m. 28 miles south b 1 to 5 p.m. 20 miles south of the same day Or information may come in even more complicated ways. The likeli- hood function can be constructed only if we know both the circum- stances of our observations and the readings derived from them. Cases 1, 2, and 3 lead to different likelihood functions because the 2.4. THE FORM OF THE LIKELIHOOD FUNCTION 29 circumstances differ. Two observers, each of whom watched the boat for 2 consecutive hours unbeknownst to the other, would set up two likelihood functions identical in form into which they would feed different readings. But each investigator would set up on« and r uy one likelihood function. This is a function of a single sample^ th$ sample, his sample; no matter how independent, complicated, or interdependent his observations may be, they form a single sample. The maximum likelihood criterion tells us to proceed as if Nature did the most probable thing. We assert this about the totality of observations in the sample rather than about any single observation. 2.4. The form of the likelihood function Return to the consumption model C t = a + yZ t 4* t*i. Tht3 fol* lowing statement must be accepted on faith (its proof is a decpish theorem in analysis): Under the assumptions that Nature conforms to the model and that the true values of the parameters are a and y t the probability of observing the particular sample Ci, ft, . . , , C& 9 Zi, Z*, . . . , Z$ is equal to the probability that the error term shall have assumed the particular values u%, u 2 , . . . , u$ multiplied by & factor det J. The term det J happens to be equal to 1 in all single-*-^ cases; so we need not worry about it yet. It becomes important }i two-or-morc-cquation models* The statement cited above is of immense and curioM§ ^gniflcmnee^ We observe the sample C„ Z,. But we cannot know (HfrvAlj h®W probable or improbable it is to obtain this particular Simple, s\m<i all our stochastic assumptions have to do with the probability distribu- tion of the u's } not of the C's and Z's. On the other hand, Wi mm never observe the random errors themselves. So one might despciir of finding the probability of this particular sample but far tfm remark- able property cited. Let L stand for the probability ©f the sample and q for the probability that the random term will take cm the values Wii w 2 , . . . , Us. Then we have L = det J • q(ui,u 2 , ... y u s ) (24) Now, the (unobservable) w's are functions of the (observed) C's and Z's and of (the unknown) a and 7, because the model implies 30 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES Ut = Ct — a — yZ t . To maximize likelihood is to seek the pair of values of a and 7 that makes L as large as possible. What form q(u h . . . ,u s ) takes depends on the stochastic assump- tions about the error term. This concludes the discussion of the logic behind maximum likelihood estimating of a and 7. In the next few pages I discuss the mechanism of maximizing L under the Six Simplifying Assumptions. On a first reading, you might skip the rest of this section without serious loss. Readers who wish to refresh your manipulative skills, read on! We shall now omit writing det J = 1, since we are discussing only single-equation cases at this point. By Simplifying Assumption 4, the random terms u h u% } . . . , us come from a multivariate normal distribution. Therefore (2-2) applies, and L - q(u h u h . . . ,u s ) = (2ir)sn det (<r uu )-*, ^ 4 exp [-y 2 (u - eu5V uu )- l (u - en)] (2-4) By Simplifying Assumption 2, Zu t — 0. Simplifying Assumption 3 states that all diagonal elements of o- uu are equal to a finite constant (T uuy and Assumption 5 states that all nondiagonal elements are zero; so det (cr uu )->* = (*uu)- Sf * Therefore, (2-4) reduces to L - (2ir)- s >*(v uu )- s '* exp •1 1 u and finally to a L = (2t)-.s/V««)- 5 " exp [-^((r Wtt )- 1 ^ w 2 .] (2-5) •-1 The following properties of L will not be proved: 1. L is a continuous function with respect to a, 7, cr„„ except at <r« tt = 0. This means that it can be differentiated quite safely. As for the exception, we need not worry about it; for w, as a random variable, assumes at least two distinct values in the universe, and 2.5. JUSTIFICATION OP THE LEAST SQUARES TECHNIQUE 31 therefore c U u > 0. If the sample is of only two observations, the fit is perfect; m uu is zero — but in that case we do not use the likelihood approach at all. We just solve the two equations C\ =» a + yZi and Ci = a 4" 7Z2 for the two unknowns a and 7. 2. Setting the partial derivatives of L equal to zero locates its maxima. It has no minimum; therefore, we do not need to worry about second-order conditions of maximization. 3. L is a maximum when its logarithm is a maximum. So, instead of (2-5), we maximize the more convenient expression log L m - - log 2ir - g log <r«„ - g (*««)~ l \ u \ (^ 6 ) 4. The'true values of a, 7, and <r uu are not functions of one another, but constants. Therefore, in maximizing, all partial derivatives? of m, y, and <r with respect to one another are zero. Maximizing (2-6) results in V (C. - a - yZ.) - V (C. - « - 7^)2. - 0' (2 .7) J V (C. - « ~ 7^.) 2 - *.. The solution of (2-7) for a, 7, o^m gives the maximum likelihood esti- mate <$, ^, £. 2.5. Justification of the least squares technique It is evident that (2-6) gives least squares estimates for a, 7, a. System (2-7) says that the maximum likelihood values of a and 7 are the values that minimize the sum of the squares of the residuals #,. The last equation in (2-7) states that the maximum likelihood esti- mate & uu of the true variance o-„„ is the average square residual. This, then, is the justification for minimizing squares. Remember that to get this result we had to make use of a great many assumptions both about the model itself and about the nature of its error term. If any one of these many assumptions had not been granted, we might not have reached this result. Therefore, one should not go 32 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES about minimizing squares too lightheartedly. For every different set of assumptions a certain estimating procedure is best, and least squares is best only with a proper combination of assumptions. Con- versely, every estimating procedure contains in itself (implicitly) some assumptions either about the model, or about the distribution of w, or both. 1 Digression on computational arrangement uu It pays to develop a tidy scheme for computing d, 1, and & because computation recipes similar to (2-7) turn up pretty often. It is always possible to arrange the computations in such a way as to estimate the coefficient y of the independent variable first. With •? in hand, one computes the constant term d. Finally, with d and 1, one computes the residuals #„ and from these residuals, an estimate of <r uu . An analogous procedure for models having several independent variables (and, hence, several 7s) is developed in the next digression. In all cases, that is to say, for simple as well as for complicated models, I shall describe only the computational steps for estimat- ing the 7s (coefficients of the independent variables). Write (2-7) as follows: aS + yXZ, = 2CV aZZ, + yZZ\ = 2C,Z. where the sums run over the entire sample. Now subtract 2Z. times the first equation from S times the second. The result is 7 [£2Z 2 - (2Z)(2Z)] - [S2CZ - (2C)(2Z)] (2-8) Note that we have eliminated a and that, moreover, in the square brackets we may recognize the familiar moments, defined in Chap. 1. Thus (2-8) is equivalent to 1 The Six Simplifying Assumptions are sufficient but not necessary conditions for least squares. Least squares is a "best linear unbiased estimator" under much simpler conditions. This, however, is another subject. I chose these particular six assumptions because with them it is easy to show how a stochastic specification and an estimating criterion lead to a specific estimate of a parameter rather than to some other estimate. 2.6. GENERALIZED LEAST SQUARES 33 yvftzz = m>cz and the estimate of 7 can be expressed very simply as <? = (m zz )- l m zc or ^£ (2-9) mzz which, besides being compact, generalizes easily to N dimensions, i.e., by replacing the Greek and italic letters by the correspond- ing characters in boldface type : ? - (m Z z)- 1 mzc or ^ (2-9a) mzz 2.6. Generalized least squares All the principles discussed so far apply to all linear models consisting of a single equation. To treat the general case, we shall make a slight change in notation: y will stand for any endogenous variable (the role played by consumption C so far) and z for any exogenous variable (the role played by lagged income Z so far). Let us suppose that the endogenous variable y(t) depends on // different predetermined variables Z\(t), 22(0 > . . . , zn(t) as follows (omitting time indexes) : y = a -f 7i2?i -|- 72^2 + ■•'••.+ ynZ H + u (2-10) Indeed, the analogy of (2-10) with C = a -f- yZ + u is so perfect that everything said about the latter applies to the shorthand edition of (2-10): y = a + yz + u where y is the vector (71,72, . . . ,7/y) and z is the vector 1 (21,2s, . • • ,2//). But we must be careful. The first five Simplifying Assumptions (u is random, has mean zero, has constant variance, is normal, and is serially independent) need no alteration* Assumption 6 must, how- ever, be changed to read as follows: The error term u t is fully inde- pendent of Zi(jt) t 2 2 (0> • • • > 2#(0. Under the new version of the Simplifying Assumptions, the maxi- 1 For typographical simplicity I shall not bother, in obvious cases, to dis- tinguish a column from a row vector. In this case z is a column vector. 34 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES mum likelihood criterion leads to the estimate of 71, 72, ... , in that minimizes the sum of the squared residuals. And, moreover, these estimates are given by the formula t - (m,,)-^, (2-11) which is exactly analogous to (2-9). What these boldface symbols mean is explained in the next digression. Digression on matrices of moments and their determinants This is a natural place to introduce some extremely convenient notation, which we shall be using from Chap. 6 on. If p, q, r, x, y are variables, m^ p , q , r ).( XlV ) is the matrix whose elements are moments that can be constructed with p, q, r on x and y. The variables in the first parentheses correspond to the rows a^d those in the second to the columns. Thus, m (p.«.r)-(x.v) m, m. qx m rx Ttlpy m qy Wry j The middle dot in the subscript may be omitted. Likewise, m« means the matrix whose elements are moments of the variables z if z 2 , . . . , zh on themselves: m«« = *i*i i*» Lm IB*** m *fl»t and m™ means m *iV m *l*B m : *a*B* .m t *u- 2.7. THE MEANING OF UNBIASEDNESS 35 Every square matrix has a determinant. So does every square matrix of moments, for instance, m M ; for the determinant of m we write det m, perhaps with the appropriate subscripts dct m n , or det m< Zl ,*, * B )Uu*% **>• But it is simpler to write m„ instead of det m or det m„; and we shall do this for compactness. The lightface italic m in \ the expression m zz indicates that the determinant is a simple number, like 2 or 16.17, and neither a vector nor a matrix of numbers (these are printed bold). One way to estimate the coefficients 71, 72, . . . , 7# is to perform the matrix operations given in (2-11). Another way is by Cramer's rule } which calculates various determinants and computes 71 ■= ™(fn*i *m)(*i. » *h) A rtl( tx ,y, ... . t „ )(tl, t t f//) 72 = W( Zl>f|( . , . , f tf )(*,,*, g B ) (2-12) * __ Wjtu't l/)(gl.«| *b) iff yy. "*(»li*| *b)Ui>** *b) Both these ways are very cumbersome in practice for equations with more than three or four variables, unless we have ready programs on electronic computers. Appendix B gives a stepwise technique for calculating ^1, ^2, . . . , 1h that can be used on an ordinary desk calculator. Matrix inversion is discussed in Appendix A. 2.7. The meaning of unbiasedncss Let us discuss bias and unbiasedncss by using the original model of consumption C t = a + yZ t -f- u>t with the understanding that all conclusions hold true for the generalized single-equation model y = a + 71Z1 + * * ' + IhZh + u. Furthermore, we can restrict our- selves, with some exceptions, to the discussion of 7, because the statements to be made are also true of a. 36 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES Imagine that we obtain our guess ^ of the parameter 7, violating none of the Simplifying Assumptions. The guess so chosen is the most likely in the circumstances. But this does not guarantee it to be equal to the true value 7. This is so because the observations C„ Z, we have to go on are just a sample. And in sampling anything can happen. Extremely atypical misleading samples are improbable but perfectly possible. So it makes sense to ask how f ar off the guess i is likely tojbe from the true value 7. Here it is very important to distinguish between (1) taking again and_agcoin a sample of size S, (2) taking bigger and bigger. samples ( one of ea ch size) . The first procedure is connected with the important statistical notion of bias, the second with that of consistency. Both procedures are ideal and impractical, because such samples must be taken from the universe (level III) and not merely from the population (level II). Therefore, even with infinite resources and infinite patience, the concepts are not operational. Consider any estimating recipe (say, least squares). Choose a sample size, say, S = 20; draw (from the universe) all possible samples of size 20; for each sample compute (by least squares) the correspond- ing <$; then average out these fs. If the average 'f equals the true 7, then we say that the procedure of least squares is an unbiased method for estimating 7, or an unbiased estimator of 7, for sample size S — 20. Loosely, we might saj r that on the average, least squares gives a correct estimate of 7 from samples of 20 observations. In order to pin down firmly the concept of bias, I have constructed a purposely simple and exaggerated example. It involves just three time periods, very uneven disturbances from time period to time period, and a random disturbance that assumes just three different values. Yet this example illustrates all that could be shown with a larger and more realistic one. Assume that the true values of the parameters we seek to estimate are a = 4, 7 = 0.4. Assume that the population consists of exactly 3 elements, labeled a, b, c, whose coordinates are given in Table 2.1 below; the three points are shown in Fig. 2. This population could have come from an infinite universe* but let us (for pedagogic reasons) deal with a finite universe that consists of the above three points a, b, c plus four more, which are named a', b' y c', and c". Every point of the universe is completely defined when we specify the random 2.7. THE MEANING OP UNBIASEDNESS 37 ©c' and C" True (exact) relation Cj-4+0.4Z t Zt 10 15 20 Income Fig. 2. A seven-point universe. Solid dots: points in the population. Hollow dots: points in the universe but not in the population. Table 2.1 The population (a,b,c) Point Time V Z v c v u p a 1 4.05 0.05 6 2 4 0.4 c 3 14 0.6 -9.0 error u that corresponds to it and the level of the independent variable. These are given in Table 2.2. Table 2.2 Time Points IN THE POPULATION Points in the universe but not in the population Name Value of u t Value OF Zt Name Value OF Ut ValUH OF Zt 1 a 0.05 a' -0.05 2 b 0.40 4 b' -0.40 4 3 c -9.00 14 c' +4.50 14 3 ... C" +4.50 14 Exercise 2.E Which Simplifying Assumptions are fulfilled by u t in the uni- verse of Table 2.2, and which are violated? 38 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES Now let us see if the least squares method is an unbiased estimator of 7. First let us take all conceivable samples of size 2 and for each compute the least squares value *?» Samples should be taken in such a way that the same time period is not represented more than once. The population can yield only the following pairs: (a,b), (a,c), and (&,c). This is the most that a flesh-and-blood statistician, even one equipped with unlimited means, could obtain operationally, because points a', 6', c', and c" exist, so to speak, only in the mind of God. But the definition of bias requires us to check samples (of size 2) that include all points of the universe, human and divine alike. There are sixteen such samples, and the corresponding estimates & and 1 are given in Table 2.3 and plotted in Fig. 3. When all sixteen are con- sidered, it is seen that least squares is an unbiased estimator of 7 (and of a). Table 2.3 Estimates of a and y from samples of size 2 Points in the sample CORRESPONDING ESTIMATE OP 7 OF a >a b 0.4875 4.0500 a V 0.2875 4.0500 ■■" a c -0.2464 4.0498. a e 0.7179 4.0497 a c" 0.7179 4.0497 a'b 0.5125 3.9500 a'b' 0.3125 3.9500 o'c -0.2393 3.9501 a' c' 0.7250 3.9500 a' c" 0.7250 3.9500 .6 c -0.5400 8. 1600 < b c' 0.8100 2.7000 b c" 0.8100 2.7G00 b' c -0.4G00 7.0400 V C ' 0.8000 1.6400 b' c" 0.8900 1.6400 Average of all conceivable samples 57 - 0.4000 e& - 4.0000 Average of all feasible samples (un- primed points) -0.0997 5.4199 If we try all samples of size 3, we get the results tabulated in Table 2.4 and plotted in Fig. 4. 2.8. VARIANCE OF THE ESTIMATE 30 Table 2.4 Estimates of a and y from samples of size 3 Points in * THE SAMPLE a b c 5.36728 -0.30288 a b c' 3.63652 0.73558 a b c" 3.63652 0.73558 a b' c 5.00833 -0.28750 a b'c' 3.27757 0.75096 a b' c" 3.27757 0.75096 a'bc 5.29939 -0.29712 a'b c' 3.56857 0.74135 a' b c" 3.56857 0.74135 a' b'c 4.94038 -0.28173 a'b c' 3.20962 0.75673 a' b' c" 3.20962 0.75673 Average e& - 4.0000 e - 0.4000 For a sample of size 3, the least squares method is an unbiased esti- mator of both 7 and a. In certain cases, not illustrated by our simple example, (1) an estimating technique (say, least squares) may be unbiased for gome; sample sizes and biased for other sizes; (2) a method may overestimate 7 for certain sample sizes and underestimate it for others, on th$ average; (3) we may be able to tell a priori, knowing the sample size S, whether the bias is positive or negative (in other cases we cannot) | (4) a method may be unbiased for one parameter but biased for another. 2.8. Variance of the estimate In Fig. 3 I have plotted all the estimates of a and 7 for -all posalbl© samples of size 2. The same thing was done for size B in Fig. 4* In general, the estimates are scattered or clustered, depending (I) on the size S of the sample, (2) on the size and other features of the universe, (3) on the particular estimating technique we have adopted, and ultimately (4) on the extent to which random effects dominate the systematic variables. Other things being equal, we prefer an esti- mating technique that yields clustered estimates. The spread among the various estimates $ is called the variance of the estimate •?, and is 40 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES written <rtf or <r(1,1), or, sometimes, <r(W\S) if we want to emphasize what size sample it relates to. The variance is defined by and is a constant, which exists and can bo computed if the four items listed above arc known. Table 2.5 gives the values of <r for our 7 0.9 . 0.8 • 0.7 *0 0.6 - 0.5 •• 0.4 0.3 , •fa,yW4,0.4) o 0.2 0.1 1 2 3 4 5 6 7 a 0.1 -0.2 -0.3 -0.4 -0.5 -0.6- Fig. 3. Parameter estimates from all samples of size 2. $: double point. 7 0.8 0.7 *t 0.6 0.5 0.4 0.3 • fovy)-(4,0.4) 0.2 0.1 - -0.1 - 1 2 3 4 5 6 7 -0.2 - -0.3 - *«b -0.4 - -0.5 -0.6 Fig. 4. Parameter estimates from all samples of size 3. <f>: double point. seven-point example. Note the interesting (and counterintuitive) fact that the variance of the estimate can increase as the sample size increases! This quirk arises bec^usej^jn^Jhe example, the random disturbance has a skew distribution. If u is symmetrical, the variance of the estimate_decrea ses as th ej sarnp_le_ siz e increase s. Table 2.5 Sue S op samples <r($,7) 0.2325 0.2397 2.9. ESTIMATES OP THE VARIANCE OP THE ESTIMATE 41 2.9. Estimates of the variance of the estimate If we have complete knowledge, we can compute the true value of (r(i rf\S) by making a complete list of all samples of size S, computing all possible estimates of 7, and finding their variance, as I did in the above example. In practice, however, it is impossible to exhaust all samples of a given size, because the universe contains points that are not in the population. So, instead, we must be content with gue&ding at the variance of the estimate by the use of whatever information is contained in the single sample we have already drawn. At first, you might suppose that estimating <r($rt\S) is logically impossible when you have a single sample of size S to work with, because, after all, the variance of the estimate of 7 represents what happens to 7 as you take all samples of size S. All is not lost, however, because a single sample of size S contains several samples (S of them) each of size <S-minus-l. The latter we can generate by leaving out, one at a time, each observation of the original sample. Thus, if the original sample is (a,b,c) of size S — 3, it con- tains three subsamples of size 2 each, the following ones: (a,6), (o,c), and (6,c), which yield, respectively, the three estimates f (a, 6), 1(a,c), and f(b,c). We get, then, some idea about variations in the estimate of 7 among samples of size 2. Still, we know nothing about the variance of 7 as estimated from samples of size 8. Here we invoke the maximum likelihood criterion. The original sample (a,6,c) was assumed to be the most probable of its kind, namely, the family of samples containing three observations each. If this is so, then observations a, 6, c generate the most probable triplet T = {(a,6),(a,c),(6,c)j of samples containing two observations each. Therefore, the variability of i (in the triplet T) estimates its variability in samples of size 3. From Table 2.3, jfafi) = 0.4875 1(a,c) = -0.2464 ?(&,c) = -0.5400 Average = -0.0997 The variance of f in the sample triplet is equal to HK0.4875 4- 0.0997) 2 + (-0.2464 + 0.0997) 2 + (-0.5400 + 0.0997) *] - 0.1867 42 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES The last figure must now be corrected by the factor £-minus-l «■ 2 if it is to be an unbiased estimate of the variance of $ (a,6,c). Estimates of variance based on averages, if uncorrected, naturally understate the variance. The proof that & - 0.1867 X 2 - 0.3734 is an unbiased estimate of o-(f,f|3) is in Appendix C. In practice we are too lazy to estimate y again and again for all the subsamples. The formula cr(f,f \S) .= m^/(S — l)mzz gives a short-cut (and biased) estimate of the variance of 1 for samples of the original size. Table 2.6 lists these estimates for three-point samples and repeats some of the information from Table 2.4. Tabic 2.6 Estimates of 7 and of the variance of its estimates Points in 7 1(» M m ^ THE SAMPLE J( "' )u, (S-l)m^ a b c -0.30288 0.02605 a b c' 0.73558 0.00256 a b c" 0.73558 0.00256 a b' c -0.28750 0.01377 a b'c' 0.75096 0.00895 a V c" 0.75096 0.00895 a'bc -0.29712 0.02731 a'b c' 0.74135 0.00217 a' b c" 0.74135 0.00217 a'b'c -0.28173 0.01377 a' b' c' 0.75673 0.00822 a' b' c" 0.75073 0.00822 Table 2.6 must be interpreted carefully. To begin with, the investi- gator will usually know only its first line, because he has a single sample to work with. The remaining lines are put in Table 2.6, for pedagogic reasons, by the omniscient being who can consider all possible worlds. Events, could have followed one or another course (and only one) among the courses listed in the several lines of Table 2.6. It just happened that (a,6,c) materialized and not some other triplet. It yielded the two estimates -f = —0.30288, a very wrong estimate, and ff = 0.02C05. The latter misleads us to believe in the likelihood of the former. 2.11. THE MEANING OP CONSISTENCY 43 If sample (a',h',c') had materialized, the two guesses would have been 1 = 0.73558 (not so bad as before) and 5 = 0.00256, which is ten times as "confident" as before. It is entirely possible for a sample to give a very wrong parameter estimate with a great deal of confidence. The mere fact that 5(1,1) is small does not make 1 a good guess. It is comforting, of course, to have some measure of how much 1 varies from sample to sample. What is upsetting is that the measure is itself a guess. True, it is better than nothing, but this is no con- solation if by some quirk of fate we have picked a sample so atypical that it gives us not only a really wrong parameter estimate 1 , but also a really small 5(1,1). The moral is: Don't be cocksure about the excellence of your guess of y just because you have guessed that its variance v(1,1) is small. 2.10. Estimates ad nauseam Note carefully now that, whereas cr(1,1) is a constant, 5(1,1) is not, but varies with each sample of the given sizeJ Therefore 5(1,1) itself has a variance, which we may denote by c(5(1,1))\ this is a true constant. Now there is nothing to prevent us from making a guess at the latter on the basis of our sample, and this guess would be symbolized by 5(5(1,1)), which is no longer a constant but varies with each sample, and so has a true variance a(5(5(1,1))) — and so on, ad infinitum. In other words, we cannot get away from the fact that, if all we can do about 7 is to guess that it equals 1 , then all we can do about its variance c(1,1) is to guess it too; likewise all we can do about this last guess is to guess again about its true variance, and so on forever. Guess we must, stage after stage, unless we have some outside knowledge. Only with outside knowledge can the guessing game stop. The game is rarely played, however, beyond 5(1,1), (1) because it is quite tedious, and (2) because large enough samples give good 1a and as with high probability. 2.11. The meaning of consistency As in our explanation of unbiasedness, let us discuss the parameter 7 of the model C t = a + yZ t + u t with the understanding that all 44 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES conclusions generalize to all the parameters in the model y = a + 7i2i -) h ynzn + u Consider any estimating recipe, say, least squares or least cubes. Choose a sample of a given size, say, S = 20, and compute *?. Then choose another sample containing one more observation (S = 21) and compute its ^. Keep doing this, always increasing the sample's size. The bigger samples do not have to include any elements of the smaller samples — though this becomes inevitable as the big samples grow, if the universe is finite. 1 If, as the size of the sample grows, the estimates f improve, then we say that the least squares procedure is a consistent estimator of 7. Note that 1 does not have to improve in each and every step of this process of increasing the size of the sample. Improvement in the above paragraph means that the probability distributions of 1(S), 1(S -f 1), . . . become more and more pinched as the} r straddle the true value of the parameter. Digression on notation There are two variant notations for consistency. Let y(s) be the consistent estimator from a sample of s observations. Let e and Tj be two positive numbers, however small. Then there is,, some size S for which P(\y(s) - 7I < > 1 - v if s > S. A shorthand notation for the same thing is P lim y(s) = 7 which is to be read "y(s) converges to 7 in probability/' or "probability limit of y(s) is 7." Under very weak restrictions, a maximum likelihood estimate is also a consistent estimate. Note, however, that, even when the method is consistent, there is no guarantee that the estimate will improve every time we take a larger sample. It may turn out that our sample of size 2 happens to contain points a and b, which give an estimate f(2) = 0.4875, and the next larger sample happens to contain points 1 A sample could, of course, bo infinite without ever including all the elements of an infinite universe. 2.12. THE MERITS OP UNBIASEDNESS AND CONSISTENCY 45 o, 6, and c, which give an estimate 7(3) » —0.30288, which is much worse. Even when the larger sample includes all the points of the smaller, as in the example just cited, it can give a worse estimate. This is so because the next point drawn, c, may be so atypical as to outweigh the previous typical points a and b. 2.12. The merits of unbiasedness and consistency Are the properties of unbiasedness and consistency worth the fuss? Remember the fundamental fact that with limited sampling resources it is not possible to estimate y correctly every time, even when the estimating procedure is unbiased and consistent. Because of a small budget, our sample may be so small that ^ has a large variance. Even if the sample is large, it may be an unlucky one, yielding an extremely wrong estimate. The mistake has happened, and it is no consolation to know that, if we had taken all possible samples of that size, we would have hit the correct estimate on the average. The following complaint is a familiar one from the area of Uncertainty Economics: Some people advise me to behave always so as to maximize my expected utility; in other words, to make once-in-a- lifetime decisions as if I had an eternity to repeat the experiment. Well, if I get my head chopped ofT on the first (and necessarily final) try, what do I care about the theoretical average consequences of my decision? Wherever a comparatively crucial outcome hinges on a single correct estimate, unbiasedness is not in itself a desirable property. Likewise, it is mockery to tell an unsuccessful econometrician that he could have improved his estimate if he had been willing to enlarge his sample indefinitely. What, then, is the use of unbiasedness and consistency? In them- selves they are of no use; they do help, however, in the design of samples and as rules for research strategy and communication among investigators. There is a body of statistical theory — not discussed in this work — which tells us how to redesign our sample in order to decrease bias and inconsistency to some tolerable level. For example, with infinite universe, if we have two parameters to estimate, the theory says that a sample must be larger than 100 if consistency is to become " effective at the 5 per cent level." Whether we want to take a sample that large 46 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES depends on the use and strategic importance of our estimate as well as on the cost of sampling. All this opens up the fields of verification and statistical decision, into which we shall not go here. Unbiasedness, consistency, and other estimating criteria to be introduced below are sometimes conceived of as scientific conventions: 1 If content to look at the procedure of point estimation unpretentiously as a social undertaking, we may therefore state our criterion of preference for a method of agreement so conceived in the following terms: (i) different observers make at different times observations of one and the same thing by one and the same method; (ii) individual seta of observations so conceived are independent samples of possible observations consistent with a framework of competence, and as such wo may tentatively conceptualise the performance of successive sets as a stochastic process; (iii) we shall then prefer any method of combining constituents of observa- tions, if it is such as to ensure a higher probability of agreement between successive sets, as the size of the sample enlarges in accordance with the assumption that we should thereby reach the true value of the unknown quantity in the limit; (iv) for a given sample size, we shall also prefer a method of combination which guarantees minimum dispersion of Values obtainable by different observers within the framework of (i) above. In the long run, the convention last stated guarantees that there will be a minimum of disagreement between the observations of different observers, if they all pursue the same rule consistently. . . . We have undertaken to operate within a fixed framework of repetition. This is an assumption which is intelligible in the domain of surveying, of astronomy or of experimental physics. How far it is meaningful in the domain of biology and whether it is ever meaningful in the domain of the social sciences are questions which we cannot lightly dismiss by the emotive appeal of the success or usefulness of statistical methods in the observatory, in the physical laboratory and in the Cartographer's office. Philosophers of probability are still debating whether the italics of the quotation do in fact define a universe of sampling, whether it can be defined apart from the postulate that an Urn of Nature underlies everything, and whether the above scientific conventions become reasonable only upon our conceding the postulate. 1 Lancelot Hogbcn, Statistical Theory, pp. 1106-207 (London: George Allen & Unwin, Ltd., 1957). Italics added. 2.14. LEAST SQUARES AND THE CRITERIA 47 2.13. Other estimating criteria So far I have mentioned three estimating criteria, or properties that we might desire our estimating procedures to have. These were (1) maximum likelihood, (2) unbiasedness, (3) consistency. Some others are: 4. Efficiency If y and 1 are two estimators from a sample of S observations, the more efficient one has the smaller variance. It is possible to have ^(7>7) < <r(1rt) f° r some sample sizes and the reverse for other sample sizes; or one may be uniformly more efficient than the other; some estimators are most efficient, others uniformly most efficient. 5. Sufficiency An estimator from a sample of size S is sufficient if no other estimator from the same sample can add any knowledge about the parameter being estimated. For instance, to estimate the population mean, the sample mean is sufficient and the sample median is not. 6. The following desirable property has no name. Let o($tf\S) shrink more rapidly than <r(y,y\S) as the sample increases. Then i is more desirable than y. There is no end to the criteria one might invent. Nor are the criteria listed mutually exclusive. Indeed, a maximum likelihood estimator tends to the normal distribution as the sample increases; it is consistent and most efficient for large samples. A maximum likelihood estimator from a single-peaked, symmetrically distributed universe is unbiased. 2.14. Least squares and the criteria If all the Simplifying Assumptions are satisfied, the least squares method of estimating a and y in single-equation models of the form C t - a + yZ t + u t (2-13) yields maximum likelihood, unbiased, consistent, efficient, and suf- ficient estimates of the parameters. This result can be generalized 48 ESTIMATING CRITERIA AND THE METHOD OF LEAST SQUARES in a variety of directions. The first generalization is that it applies to a model of the form y(t) - « -f 7i*i(0 -f y%z%{t) + • • • 4- -hMD + u(t) (2-14) where y l§ the endogenous variable and the z'a are exogenous variables. (Least squares is biased if some of the z's are lagged values of y— this question is postponed to the r ext chapter.) Lea^t squares yields maximum likelihood, unbiased, consistent, sufficient, but inefficient estimates if the variance of u t is not constant but varies systematically, either with time or with the magnitude of the exogenous variables. Such systematic variation of its variance makes u heteroskedastic. 2.15. Treatment of he teroskedasticity We shall confine the discussion of heteroskedasticity to model (2-13) on the understanding that it generalizes to (2-14). The random term can have a variable variance <r U u(t) for various reasons: 1. People learn, and so their errors of behavior become absolutely smaller with time. In this case a(t) decreases. 2. Income grows, and people now barely discern dollars whereas previously they discerned dimes. Here a{t) grows. 3. As income grows, errors of measurement also grow, because now the tax returns, etc., from which C and Z are measured no longer report pennies. Here c{t) increases. 4. Data-collecting techniques improve. v(t) decreases. Consider Fig. 5. It shows a sample of three points coming from a heteroskedastic universe. Since the errors are heteroskedastic, we would, on the average, expect observations in range 1 to fall rather near the true regression line, observations in range 2 somewhat farther, and in range 3 farther still. In any given sample, say, (a,b,c), points b and c should ideally be "discounted" according to the greater variances that prevail in their ranges. Using the straight sum of squares is the same as failing to discount b and c. The result is that sample (a,b,c) gives a larger value for 7 than it would if observations had been properly discounted. If no allowance is made for the changing variance <r(t), least squares 2.15. TREATMENT OP HETEROSKEDASTICITY 49 fits are maximum likelihood, unbiased, and consistent but inefficient. To show inefficiency, consider the likelihood function of (2*4), There, the matrix 4 of the covariances of the random term not only was diagonal but had equal entries; so it could factor out [lit (i^)] and drop out when the likelihood function was maximized with respect to 7 (and a). It is this fact that made i an efficient estliimta, With unequal entries along the diagonal, this is no longer possible, To obtain an efficient, unbiased, and consistent estimate of y, we must solve a complicated set of equations involving 7, <r(l), * ■ , , *(&)% Somewhat less efficient (but more so than minimizing Su 2 ) is, to make V Range 1 Range 2 • x 1 ■ ! 1 Range 3 #c 1 ^* ■ 1 1 1 1 1 1 1 , 1 1 ,!, , , ■ , . 4 14 Zg Fig. 5. A typical sample from a heteroskedastio univerli. (from outside knowledge) approximate guesses about cr(l), * » * , 9 (&) and to minimize the sum of squares of appropriately " deflated * # residuals (see Exercise 2.G). This, too, is an unbiased and consistent estimate. Exercises 2.F Prove that *? = mzc/mzz is unbiased and consistent even when u is heteroskedastio. 2.G Let <f>(s) be an estimate (from outside information) of l/<r««(s). Prove that minimizing 2<t>(s)u 2 (s) yields the following estimate of 7: 4 = (S(0)S0CZ) - (2<}>Z)(2<f>C) 7W (2<t>)(2ct>Z 2 ) - (2tf>Z) 2 2.H Prove the unbiasedness and consistency of i (<£). 50 ESTIMATING CRITERIA AND THE METHOD OP LEAST SQUARES Digression on arbitrary weights The weights <f>(s) are arbitrary. Is there no danger that the denominator of ^(<£) might be (nearly or exactly) zero and blow up the proceedings? Answer: There is none. Proof - £ <fc<fc(Z< - Zy) 2 > It is perfectly proper to deflate the heteroskedastic residuals by the exogenous variable Z itself and to fit by least squares the homoskedastic equation § = «;~ + Y + ! (2-15) instead of the original heteroskedastic one C = a + yZ + u (2-16) From (2-15) and (2-16) we obtain numerically different consistent and unbiased estimates of a and y. Exercise 2.1 Prove that d(Z) = m(c/z)(i/z)/m ( i/z)(i/z) is unbiased and con- sistent. Further readings Maurice G. Kendall, "On the Method of Maximum Likelihood" (Journal of Vie Royal Statistical Society, vol. 103, pt. 3, pp. 389-399, 1940) discusses the reasonableness of the method and the concept of likelihood. Whether the principle of maximum likelihood is logically wound up with subjective 51 belief or inverse probability is still under debate. The intrepid reader who leafs through the last 30 or so years of the above Journal will be rewarded with. the spectacle of a battle of nimble giants: Bartlett, Fisher, Gini, Jeffreys, Kendall, Keynes, Pearson, Yule. The algebra of moments is a special application of matrices and vectors. Matrices and determinants are explained in the appendixes ©f Klein apd Tintner. Allen devotes two chapters (12 and 13) to all the vector, matrl^ and determinant theory an economist is ever likely to need. The estimating criteria of unbiascdness, consistency, etc., are clearly stilted and briefly discussed in the first dozen pages of Kendall's second volume, and debunked by Hogben in the reference cited in the text. The reason for using m^/Sm tt as an estimate of o-(f ,f ), the formula for estimating cov (&,i) and cr((2,d), and the extensions of these formula! for several y variables are stated and rationalized (in my opinion, not too con- vincingly) by Klein, pp. 133-137. CHAPTER 3 Bias in models of decay 3.1. Introduction and summary This chapter is tedious and not crucial; it can be skipped without great loss. I wrote it for two reasons: to develop the concept of conjugate samples, and to show what I have claimed in the Preface: that common-sense interpretations of intricate theorems in mathe- matical statistics can be found. The main proposition of this chapter is that a single-equation model of the form Vi - V(t) - « + 7i*i(0 +•■•• + VHZ H (Jt) + u(t) (3-1) in which some of the z's are not exogenous variables but rather lagged values of y itself, necessarily violates Simplifying Assumption 6, and hence that maximum likelihood estimates of a, 71, ... , 7# are biased. The concept of conjugate samples gives a handy and simple-minded but entirely rigorous way to test for bias. It will be used again and again in later chapters for models much more complicated than (3-1). Equations involving lags of an endogenous variable are called autoregressive. 52 3.2. VIOLATION OP SIMPLIFYING ASSUMPTION 6 53 Most satisfactory dynamic econometric models are multivariate autoregressive systems, in other words, elaborate versions of (3-1), and share its pitfalls in estimation. We shall see that the character of initial conditions affects vitally our estimating procedure and that, unfortunately, in econometrics the initial conditions are not favorable to estimation, though in the experimental sciences they commonly are. If the initial condition y(0) is a fixed constant Y, the maximum likeli- hood criterion leads to least squares regression of y(t) on y(t — 1), and the resulting estimate for 7 is biased, except for samples of size 1. If y(0) is a random variable, independent of u, then the maximum likelihood criterion does not lead to least squares. If least squares are used in this instance, they lead to biased estimates, again with the exception of samples of size 1. CONVENTION The size S of the sample is given in units that correspond to the number of points through which a line is fitted. Thus, if we observe only 2/3 and 2/2, this is a sample of one; S = 1. If we observe 2/4, 2/3, and 2/2, this makes a sample of two points, S « 2, and so on. In his proof of this theorem, Hurwicz (in Koopmans, chap. 15) would call these, respectively, samples of size T = 2 and T = 3. The difference is important when observations have gaps (are not consecutive). We shall confine ourselves to consecutive samples. Appendix D deals with the general case. 3.2. Violation of Simplifying Assumption 6 A lagged variable, unlike an exogenous variable, cannot be inde- pendent of the random component of the model. In (3-1) a lagged value of y is necessarily correlated with some past value of u, because 2/(0 and u(t) are clearly correlated. Therefore, the very specification of (3-1) rules out Simplifying Assumption 6. But why worry about such models? Because (3-1) and its generali- zations express in linear form oscillations, decay, and explosions, which are all of great interest and which are, indeed, the bread and butter of physics, astronomy, and economics. For instance, springs behave substantially like 2/(0 = a + yi y(t - 1) + y 2 y(t - 2) +u(t) 54 BIAS IN MODELS OP DECAY and radioactive decay and pendulums like y(t) - yy(t - 1) + «(0 (3-2) Business cycles are more complicated, involving several equations like (3-1). Why do we want unbiased estimates? There are excellent reasons. If the world responds to our actions with some delay or if we respond with delay to the world, in order to act correctly we need to know the parameters accurately. How hot the water in the shower is now depends on how far I had turned the tap some seconds ago. If my estimate of the parameter expressing the response of water temperature to a turn of the tap is biased, this means that I freeze or get scalded or that I alternate between these two states, and, in any event, that I reach a comfortable temperature much later than I would with an unbiased estimate. In economics, consumers, businesses, and governments act like a man in a shower. The information they get about prices, sales, orders, or national income comes with some delay and reflects the water temperature at the tap some time ago. Moreover, it takes time to decide and to put decisions into effect. If the decision makers have misjudged how strong are the natural damping properties of the economy, decisions and policy will either overshoot or undershoot the mark, or alternate between overshooting and undershooting it, and will cause uncomfortable and unnecessary oscillations in economic activity. Our discussion will now be confined to the simplest possible case (3-2). Let consumption this year y(t) depend on consumption last year y(t — 1), as in (3-2). If the relationship involved a constant term a, we eliminate a by measuring y not from the origin but from its equilibrium value. I shall illustrate my argument by a concrete example where the true y has the convenient value 0.5 and where the initial value Y is fixed and equal to 24. In Fig. 6, line OP represents the exact relationship y t = 0.5y t -i. 3,3. Conjugate samples In model (3-1) witli fixed initial conditions, We can describe a sample completely by mentioning two things: (1) what time periods it includes 3.3. CONJUGATE SAMPLES 55 and (2) what values the disturbances took on in those periods. For example, (a,b,c,d) in Fig. 6 is completely described by [ 8 - 1, 2, 3, 4] [u. = 4, 0, 0, OJ (a' t b f ,c' ,d f ) is described by [s= 1, 2, 3, 4] [u. = -4, 0, 0, OJ and (a',&',d',e) by s - 1, 2, 4, 5 w. = -4, 0, 0, If w is symmetrically distributed, all conceivable samples of size S that one can draw from the universe can be arranged in conjugate sets. We shall see that in each conjugate set the maximum likelihood estimates 16 14 • 12 ^s^K 10 8 . K^^^ y^^**^ ui--4 a' 6 4 2 << d . i i r 2 4 6 8 10 ■ 12 14 16 18 20 22 24 26 y,_ x Fig. 6. Conjugate disturbances. of 7 average to less than the true value of y and, therefore, that maximum likelihood estimates are biased for all samples of size S. These propositions need to be qualified if S = 1 or if 7 is not between and 1; they are proved if u(t) is normally distributed, but only conjectured if u(t) has some other symmetrical distribution. For an introduction to the concept of conjugate samples, consider 5G BIAS IN MODELS OF DECAY Fig. 6, which depicts two of the many possible courses that events can follow under our assumptions that y — 0.5 and Y = 24. One course is represented by the points a, 6, c, d, e, . . . ; the other by a', &', c', d', . . . . In the first course, the disturbance is equal to +4 in period 1 and zero thereafter. In the second course, it is —4 in period 1 and zero thereafter. The samples S(+) = (a,b,c,d) and S( — ) = (a f ,b' ,c' ,d') are conjugate samples, and form a conjugate set. Similarly, (a,b,c) and (a',6',c') form a conjugate set. To be conjugate, two samples must be drawn from the same time span s = 1, 2, . . . , S; and the disturbances u a that contributed to corresponding observations must have the same absolute value in the two samples. This definition is for consecutive samples only. Appen- dix D extends it to the nonconsecutive case. Thus, sample «*-U:J!]. forms a conjugate set all by itself. has as its conjugate Sample s m 3, 4, 5, 6 u t = 0, 0, 17, [ • - 3, 4, 5, 6] [u, = 0, 0, -17, OJ f • - 4, 5, 6, 71 [u, - 0, 1, 0, -9J has three conjugates, the following: [ « = 4, 5, 6, 7] f.f-4, 5, 6, 7] [u. = 0, -1, 0, -9 J [w,«0, 1, 0, 9 J s = 4, 5, 6, 7 u, = 0, -1, 0, 9 3.4. SOURCE OF BIAS 57 The greatest conjugate set of samples of size S is 2*, where k (0 < k < S) represents the number of nonzero disturbances. If S » 4, the largest conjugate set contains 16 samples. 3.4. Source of bias In Fig. 6, line OR with slope ^[S(+)] = 0.6053 is the least squares regression through the origin, and sample S(+) = (a,&,c,e£); and OR' with slope t[S(— )] = 0.3545 is the same for the conjugate (a',b',c\d'). The line OR overestimates y because OR is pulled up by point a. The line OR' underestimates y because of the downward pull of point a'. As we have l A (0.6053 + 0.3545) = 0.4799 < y, the downward pull is the stronger. But why? Because point a is accompanied by 6, c, d, and a! by b' f c', d'. The primed points b' t c' f d' are closer to the origin than the corresponding unprimed points; hence, their "leverage" on their least squares line OR' is weaker than the leverago of the unprimed points on theirs (line OR). It is impossible for a' to be accompanied by b, c, d, because all future periods must necessarily inherit whatever impulse was first imparted by the random term of period 1. Points &', c ; , d' inherit a negative impulse, and points 6, c, d inherit a positive one. Another way of stating this is by referring to (3-1). In (3-1) one of the z's (say, z 4 ) is a lagged value of y (say, the lag is 2 time periods). It follows that Zi(t) is correlated with the past value of the disturbance u(t — 2), since y(t) is clearly correlated with u(t). All the proofs of bias later in this chapter and in Appendix D are merely fancy versions of what I have just shown for this special case. When conjugate sets are large, arguments from the geometry of Fig. 6, though perfectly possible, become confusing, and so we turn to algebra. With fixed initial condition 2/(0) = F, the maximum likelihood esti- mate of 7 is the least squares estimate ^ V'V-i 1 - ^ (3-3) l« 2 1 58 BIAS IN MODELS OF DECAY 3.5. Extent of the bias From (3-3) and (3-2), 1 - 7 + -^ (3-4) We write the above fraction iV//>. We shall see that the bias N/D varies with the true value of 7, the size of the sample, and the size of the initial value F. For instance, in small samples it is almost 25 per cent; in samples of 20 observations, it is about 10 per cent of the true value of 7. It never disappears, no matter what value true 7 may have or how large a sample one takes. By applying (3-2) repeatedly and letting P, Q, and R stand for polynomials, we get 2V = (wi + 7W2 4- y 2 u z -f • • • + yB-hisW + P(u u . . . ,u 8 ) D = (1 + 7 2 4- 7 4 + • • • + ^2(S- -«)F« + YQ(y,u h . . . ,u s ~ .) + R(u h . . . ,1*8- -l) By considering N/D, one can establish that the bias is aggravated the further 7 is from 4-1 or — 1 and the smaller the sample. Bias exists even when 7 = ±1 or when 7 = 0; the latter is truly remarkable, since the model is then reduced to y(t) — u(t). Since 2V is a linear function of Y and the always positive denominator D is a quadratic function of F, the bias N/D can be quite large for certain ranges of F. The above results generalize to model (3-1), although it is not easy to say whether the bias is up or down. 3.6. The nature of initial conditions The following fantastically artificial example illustrates the concept of conjugate samples and what it means for initial conditions 2/(0) to be random or fixed. An outfit that runs automatic cafeterias has its customers use, 3.6. THE NATURE OP INITIAL CONDITIONS 59 instead of coins, special tokens made of copper. The company has several cafeterias across the country, but its customers rarely think of taking their leftover tokens with them when they travel OP move from city to city. As there is at most one cafeteria per city, each cafeteria's tokens are like independent, closed monetary systems. Let us look at a single cafeteria of this kind. Originally it had coined a number of brand-new tokens and put them in circulation, using y(0) pounds of copper. Thereafter, the amount of copper in the tokens is subject to two influences. (1) To begin with, the tokens wear out as they are used. The velocity of token circulation is equal in all cities, and customers' pockets, hands, keys, and other objects that rub against the tokens are equally abra- sive in all cities. Thus, in each city, year t inherits only a part 7 (0 < 7 < 1) of the coppor circulating in the previous year* (2) In addition to the systematic factor of wear and tear, random influences are at play. First, some customer's child now and then swallows a token; this disappears utterly from circulation into the city's lowers. However, occasionally there is an opposite tendency. An amateur but successful counterfeiter mints his own token now and then, or a lost token is found inside a fish and put back into circulation. So the copper remaining in circulation is described by tha Itochastic model (3-2). The problem for the company is how to estimate the true survival rate of its tokens. It is very important to interpret correctly our first assumption that "u(t) is a random variable in each time period t." It means that u(t) is capable of assuming at least two values (opposites, if u m sym- metrical) in the same period of time. But how can it? Here we need a concept of conjugate cities analogous to conjugate samples. Imagine that the only positive disturbances come from one counter- feiter and that the only negative disturbances come from one child, the counterfeiter's child, who swallows tokens. The counterfeiter is divorced, the child was awarded to the mother, and the two parents always live in separate cities, say, Ames and Buffalo; but who lives where in year t is decided at random. Ames and Buffalo are conjugate cities, because, when one experiences counterfeiting, +u(t) t the other necessarily experiences swallowing, — u(t). If there were more families like this one, the set of conjugate cities would have to expand enough 00 BIAS IN MODELS OF DECAY to accommodate all permutations of the various values that ±u(t) is capable of assuming. We have fixed initial conditions if each cafeteria starts with the same poundage, and random initial conditions when the initial pound- age is a random variable. To estimate the token survival rate, different procedures should be used in the two cases. 3.7. Unbiased estimation Unbiased estimation of 7 is possible only if the initial copper endow- ment is a fixed constant Y. The only unbiased estimate is given by the ratio of the first two successive ?/'s using data from a single city: ~ _ 2/(1) _ 2/(1) ,0 KN 7 " W) " ~Y- (3 " 5) which is a degenerate least squares estimate. This result is really startling. It says that we must throw out any information we may have about copper supply anywhere, except in year and year 1 in, say, Ames. Unless we do this we can never hope to get an unbiased estimate. Estimating 7 without bias when each city starts with a different amount of copper is an impossible task. A complete census of copper in all cities (i) in two successive years would give the correct (not just unbiased) estimate - T I -!»_ = y 2 2/.<0 y m i I y<« - i) ' I *» ~ 1) i i We can draw another fascinating conclusion: If we have the bad luck to start off with different endowments, we can never get an unbiased estimate of 7. But suppose we find that the endowments of all cities happen to be equal later, say, in period t — 1. Then all we have to do is wait for the next year, measure the copper of any one city, say, Buffalo, and compute the ratio ? - ^- (3-6) Vt-i 3.7. UNBIASED ESTIMATION 61 which is an unbiased estimate. (Where the information would mine from, that all cities have an equal token supply in y©ar i — 1, is another matter.) The experimental scientist is, however, free from such predicaments. If he thinks radium decays as in (3-2), then he can make initial con- ditions equal by putting aside in several boxes equal lumps of radium. Then he can let them decay for a year, remeasure them, apply (3-6) to the contents of any one box, and average the results. Any One box gives an unbiased estimate. Averaging the contents of several boxes gives an estimate that is efficient as well as unbiased. The econometrician cannot control his initial conditions in this way. If he wants an unbiased estimate, he must throw away, as prescribed, most of his information, use formula (3-6), and thus get an unbiased and inefficient estimate. Or else he may decide that he wants to reduce the variance of the estimate at the cost of introducing some bias; then he will use a formula like that of Exercise 3.C below or some more complicated version of it. Autoregressive equations are related to the moving average, a tech- nique commonly employed to interpolate data, to estimate trends, and to isolate cyclical components of time series. The statistical pitfalls of estimating (3-2) plague time series analysis, and they are not the only pitfalls. The last chapter of this book returns to some of these problems. Exercises 3.A Prove that (3-5) is unbiased. 3.B Prove that y = y(2)/y(l) is biased. 3.C Prove that y = [y{2) + y(l)]/[y(l) + Y] is biased, 3.1) Let u t in (3-2) have the symmetrical distribution q(u f ) with finite variance. "Symmetrical" means q(u t ) = q( — u t ). Then the likelihood function of a random consecutive sample is {f q(u t ). Prove that the maximum likelihood estimate of y is obtained by minimizing the expression Y log g(^,), where the ft, are the vertical a deviations from the line that we are seeking. 3.E By the method of conjugate samples or by any other method, 62 BIAS IN MODELS OP DECAY prove or disprove the conjecture that the estimate of Exercise 3.D is biased. Further readings The reader who wants to see for himself how intricate is the statistical theory of even the simplest possible lagged model (3-2) may look up "Least- Squares Bias in Time Series," by Leonid Hurwicz, chap. 15 of Koopmans, pp. 3G5-383. Tintner, pp. 255-260, gives examples and shows additional complications. CHAPTER 4 Pitfalls of simultaneous interdependence 4.1. Simultaneous interdependence "Everything depends on everything else" is the theme song of the Economic and the Celestial Spheres. It means that several con- temporaneous endogenous variables hang from one another by means of several distinct causal strings. Thus, there are two causal (more politely, functional) relations between aggregate consumption and aggregate income : Since people are one another's customers, consump- tion causes income, and, since people work to eat, income causes consumption. The two relationships are, respectively, the national income identity in its simplest form Vt = c t (4-1) and the (unlagged) stochastic consumption function in its simplest form c t - a + py t + u t (4-2) Wo can imagine that causal forces flow from the right to the loft of the two equality signs. 63 64 PITFALLS OP SIMULTANEOUS INTERDEPENDENCE The moral of this chapter is that, if endogenous variables, like c and y, are connected in several ways, like (4-1) and (4-2), every statistical procedure that ignores even one of the ways is bound to be wrong. The statistical procedure must reflect the economic inter- dependence. 4.2. Exogenous variables I shall not vouch for the Heavens, but in economics there are such things as exogenous variables. A variable exogenous to the economic sphere is a variable, like an earthquake, that influences some economic variables, like rents and food prices, without being influenced back. The random term u is, ideally, exogenous — though in practice it is a catchall for all unknown or unspecified influences, exogenous or endoge- nous. One thing is certain: Earthquakes and such are not influenced by disturbances in consumption. Indeed, the definition of an exoge- nous variable is that it has no connection with the random component of an economic relationship. My prototype exogenous variable, investment z, is not really exogenous to the economic system, especially in the long run, but we shall bow to tradition and convenience for the sake of exposition. 4.3. Haavelmo's proposition The models in this chapter, like the single-equation models treated m far, (1) are linear and (2) have all the Simplifying Properties. Therefore, they are subject to all the pitfalls I have pointed out so far. Unlike the models of Chaps. 1 to 3, the new models each contain at least two equations. Most of my examples will have precisely two (and not three or four) for convenience only, since the results can easily be extended. New kinds of complication arise when a second equation is added. 1. The identification problem It is sometimes impossible to estimate the parameters — this problem is side-stepped until Chap. 6. 4.3. HAAVELMO'S PROPOSITION 65 2. The Haavelmo 1 problem The intuitively obvious way of estimating the parameters of a two-equation model is wrong, even in the simplest of cases, where one of the equations is an identity. We shall see that pedestrian methods are unable to estimate correctly the marginal propensity to consume out of current income, no matter how many years of income and consumption data we may have. Even infinite samples overestimate the marginal propensity to consume. This difficulty is as strategic as it sounds incredible. It means that the multiplier gets overesti- mated and, hence, that counterdepression policies will undershoot full employment and counterinflation policies will be too timid. Because of bad statistical procedures, the cure of unemployment or inflation comes too slowly. The model is as follows: c t = a ■+■ $y t -j- u t (consumption function) (4-2) c t + zt — Vt (income identity) (4-3) where z t (investment) is exogenous, and u t has all the Simplifying Properties. 2 We shall illustrate by assuming the convenient values a = 5, = 0.5. In Fig. 7, line FG represents the true relation c t — 5 + 0.5y t . When the random disturbance is positive, the line moves up; with negative disturbance, it moves down. Lines HJ and KL correspond, respec- tively, to random errors equal to +2 and —2. OQ f the 45° line through the origin, represents equation (4-3) for the special case in which investment z is zero. In the years when investment is zero, the only combinations of income and consumption we could possibly observe will have to lie on OQ, because nowhere else can there be equilibrium. If, for instance, in years 1900 and 1917 investment had 1 For reference to Haavelmo, see Further Readings at the end of this chapter. 2 To be specific, Assumption 6 in this case requires that u and z shall not influence each other, either in the same time period or with a lag. But the random term u cannot be independent of y. The reason is that a and are constants, z is fixed outside the economic sphere, and u comes, so to speak, from a table of random numbers; if this is so, then, by equations (4-2) and (4-3), a, /?, z, ana u necessarily determine y (and c). Thus variable y is not predetermined but codetermined with c. These statements summarize and anticipate the remainder of the chapter. 66 PITFALLS OP SIMULTANEOUS INTERDEPENDENCE been zero and if the errors had been +2 and —2, respectively, then points P and P' would have been observed. Let us now suppose that in some years investment z t equals 3. Line MN (also 45° steep) describes the situation, which is that c t + 3 » yt. With errors u t = ±2, the observable points are at R and R'. With errors ranging from —2 to +2, all observable points fall between R and R'. Fig. 7. The Haavelmo bias. Let us now pass a least-squares-regression line through a scatter diagram of income and consumption, minimizing squares in the vertical sense and arguing that from the point of view of the consumption function Income causes consumption, not vice versa. Such a procedure is bound to overestimate the slope p of the consumption function and to underestimate its intercept a. This is Haavelmo' s proposition. The least squares line (in dashes) corresponds to observation points that lie in the area PP'R'R. It is tilted counterclockwise relative to the true line FG because of the pull of "extreme" points in the corners next to R and P'. The less investment z ranges and the bigger the stochastic errors u are, the stronger is the counterclockwise pull, because lines PP' and RR' fall closer together. This overestimating of /3 persists even if we allow investment to 4.4. SIMULTANEOUS ESTIMATION 67 range very far* Though it is true that the parallelogram PP'R'R gets longer and longer toward the northeast (say, it becomes PW'P') the fact remains that V and P', the extreme corners, help to tilt the least squares line upward. This suggests that perhaps we ought to mioi- mize squares not in a vertical direction but in a direction running from southwest to northeast. In this particular case (though not generally) diagonal least squares are precisely correct and equivalent to the procedure of simultaneous estimation described in the following section. 4.4. Simultaneous estimation We know that two relations, not one, account for the slanted posi- tion of the universe points in Fig. 7. Had the consumption function been at work alone, a given income change Ay would result in a change in consumption Ac = Ay. Had the income identity been at work alone, then to the same change in income would correspond a larger change in consumption Ac = Ay. In fact, both relations are at work. Therefore, the total manifest response of consumption to income is neither Ac = Ay nor Ac = Ay } but something in between. This is why the line in dashes is steeper than FG (and less steep than OQ). In order to isolate the 0. effect from a sample of points like PP'R'R, both relations must be allowed for. This is done by rewriting the model : 01 I P I U t /A M\ v* - r=Tf + T=^ Zt + n^ <"> The term u t /(l — 0) has the same properties as u t except that it has a different variance. Therefore the error term in the new model has all the Simplifying Properties. Either of the new equations con- stitutes a single-equation model with one endogenous variable (c and y, respectively) and one independent variable (z in both cases). There- fore, the estimating techniques of Sec. 2.5 can be applied to the sophisti- cated parameters a' = «/(l - 0), 71 = 0/(1 - 0). 72 = 1/(1 - 0). Denote these estimates by the hat ( A ). For the naive least squares estimate of a and 0, derived from regressing c on y, use the bird ( v ). Let us now express these estimates in terms of moments, and let us do it for 0, 71, and 72 only, leaving aside a and a'. 68 PITFALLS Ot SIMULTANEOUS INTERDEPENDENCE a _ ■ ft _. Mc. 1 - j§ ™*« 1 - m„ | Is a biased estimate of ft, because P 33! "i i . . SSI — " ess n -r» — ■ ■ lUyy TJfiyy TTly V and it is known that /3 is inconsistent, because e 5bs ^ o (4-6) a W(«-h^+«)(>+a-f«) ^ ft + (1 + j3)(m u< /m„) 4- Tn uu /m g g ^d+a+u)(i+a+u) 1 + 2m ut /m zt + m uu /m tt The various moments m cv , m uu , etc., vary in value, of course, from sample to sample. As the sample size approaches the population size, however, m ul approaches cov (u,z) = 0, m uu approaches var u > 0, and m„ approaches var z > 0. Therefore, Pl im ^ = fl + vartt/vars 1 + var w/var z Exercises 4. A In similar fashion prove that Plim a < a. 4.B Interpret (4-6). 4.C Show that 1/(1 — /§) is a biased estimate of 1/(1 — ft). Hint: manipulate the expression 1 1 1 — $ l — m cy /m vv and use the fact that e(m v ,/m„) = 1/(1 - ft). 4.D Prove that fi is an unbiased and consistent estimate of 0/(1 - 0). 4.4. SIMULTANEOUS ESTIMATION 69 4.E Prove that ^ 2 is an unbiased and consistent estimate of 1/(1 - fl. 4.F Prove that 1 1 and f 2 yield a single compatible estimate of 0, which we call 0; $ ^m et /m yy . 4.G Prove that $ is a biased but consistent estimate of #. 4.H From the facts that /§ « -f m u ,/rn yv and that # = $ + m uy /m nt argue that the bias of $ is less serious than the bias of #. Digression on directional least squares What do we get if, in Fig. 7, we minimize the sum of th© square deviations not vertically but from the southwest to the n§?thP&st? Let P(y,c) in Fig. 8 stand for any point of the sample; p ® tZ is Q zA /5s P u %*>^M / /\45* ' y t Fig. 8. Directional least squares. parallel to the 45° line. Let be the angle of inclination of the true consumption function; that is, let tan 6 be the slope of the curve a + Py = 0. Then in triangle PZM, from the law of sines, we have u sin <f> from which it follows that sin (90 + 6) Vt V2 V2 , l-^ 1-/3 to) 70 PITFALLS OF SIMULTANEOUS INTERDEPENDENCE Then, 2j P ' " (1 - fl )l 2/ M ' " (1 Z 0)2 m (o-«-/» V )(o-«HJv) Setting c = */ — *, J Pi = ( j 1)2 l«W. + tf - !) 2 ^ + 208 - 1)«J Minimizing Spj with respect to — 1, we obtain 1 _ m yt 1 — m M that is to say, the same expression that we found for *? 2 . 4.5. Generalization of the results Section 4.3 showed the pitfalls of ignoring the income identity in estimating the consumption function, and Sec. 4.4 showed how to get around this difficulty by the technique of simultaneous estimation, which takes into account the entire model even though the investigator may be interested in only a part. Chapters 5 to 9 deal with the intricacies of simultaneous estimation and various approximations thereof. To prepare the way, let us enlarge the model slightly, by making Investment respond to income. The new model is c t = a + 0y, + u t (4-2) it "- Zi + 7 + tyt + V t (4-7) c t + it = Vt (4-8) where z t is autonomous investment; i t is total investment; and u t} v t are random disturbances independent of each other and of present and past values of z. The last sentence is a statement of Simplifying Assumption 7, which will be explained and justified in the next chapter. Letting s t — yt — c t stand for saving, we obtain from (4-2) the saving function 8 t - -a + (1 - fi)y t - n (4-9) Figure 9 shows saving SS and investment 77 as functions of income, with zero disturbances (thick lines) and disturbed by ±u f ±v, respec- tively (thin lines), and with the usual (stable) relative slopes. 4.6. BIAS IN THE SECULAR CONSUMPTION FUNCTION 71 Naive least squares applied to Fig. 9 underestimate the slop© I — of SS (as it underestimates the slope of OQ in Fig. 7) and, hence, again overestimates the marginal propensity to consume. The Haavelmo bias again. 4.6. Bias in the secular consumption function We have shown that naive curve fitting overestimates the slope of the consumption function, even with large samples and whether or not investment is a function of income. Statistical fits of the secular consumption function give a slope varying from over 0.95 to nearly 1.0, contradicting the lower figures given by budget studies, introspec- tion, and Keynes's hunch. To reconcile these facts, consumption theories of imitation, irreversible behavior, and more and more explanatory variables have been invoked. A large part of what these ingenious theories account for can be explained by Haavelmo's proposition. Further readings Trygve Haavelmo's proposition was, apparently, stated first in "The Statistical Implications of a System of Simultaneous Equations" (Econo- melrica, vol. 11, no. 1, pp. 1-12, January, 1943), but a later article of his applying the proposition to the consumption function has attracted far more attention. This has appeared in three places: Trygve Haavelmo, "Methods of Measuring the Marginal Propensity to Consume" (Journal of the American Statistical Association, vol. 42, no. 237, pp. 105-122, March, 1947); reprinted 72 PITFALLS OF SIMULTANEOUS INTERDEPENDENCE as Cowles Commission Paper 22, new series; and again as chap. 4 of Hood, pp. 75-91. Haavelmo gives numerical results and confidence intervals for the parameter estimates. Jean Bronfenbrenner, "Sources and Size of Least-squares Bias in a Two- equation Model," chap. 9 of Hood, pp. 221-235,. extends Haavelmo's propo- sition to three more special cases. An early article by Lawrence R. Klein, "A Post-mortem on Transition Predictions of National Product" {Journal of Political Economy, vol. 54, no. 4, pp. 289-308, August, 1940), puts the Haavelmo proposition in proper perspective, as indicating only one of the many sources of malestimation. Milton Friedman, A Theory of the Consumption Function (New. York: National Bureau of Economic Research, 1957), also compares and discusses rival measurements of consumption, but his main concern is to test the Permanent Income hypothesis and to refine the consumption functions, not to discuss econometric pitfalls. It contains valuable references to the literature of the consumption function. According to Guy H. Orcutt, " Measurement of Price Elasticities in Inter- national Trade" {Review of Economics and Statistics, vol. 32, no. 2, pp. 117-132, May, 1950), Haavelmo's proposition explains why exchange devaluation had been underrated as a cure to balancc-of-payments difficulties. Orcutt con- fines mathematics to appendixes and gives many further references. Tjalling C. Koopmans in "Statistical Estimation of Simultaneous Economic Relations" {Journal of the American Statistical Association, vol. 40, no. 232, pt. 1, pp. 448-4G6, December, 1945), discusses the Haavelmo proposition with the help of a supply-and-dcmand example and with interesting historical comments. When the random disturbances are viewed not as errors of observation clinging to specific variables but as errors of the econometric relationship itself, then they affect all simultaneous endogenous variables symmetrically, and Haavelmo's problem rears its head. The Koopmans article is a good preview of the next chapter. CHAPTER S Many- equation linear models 5.1. Outline of the chapter The moral of Chap. 4 is this: if a model has two equations they cannot be estimated one at a time, each without regard for the other, because both take part together in generating the phenomena from which we draw samples. This fact rules out, except in special cases, the use of the pedestrinr technique of naive least squares. Both the moral and the reasom ehind it remain in force as the number of equations in the model 1. ^ The present chapter is rathe nimportant, and might be skipped or skimmed at first. All its principles are implicit in Chap. 4. The main task of Chap. 5 is to systematize the study of many- equation linear models. First we present some standard and effort- saving notation (Sec. 5.2). Next, we review the Simplifying Assump- tions, which were originally introduced for one-equation models in Chap. 1, to see precisely how they extend to the general case (Sec. 5.3). With two or more equations, a seventh Simplifying Assumption is required, that of stochastic independence among the equations (Sec. 5.4). The presence of several simultaneous equations in a model compli- cates the likelihood function with the term det J, which we have 73 74 MANY-EQUATION LINEAR MODELS ignored until now; in intricate fashion det J involves the parameters of all equations in the system. (The last proposition merely restates the moral of Chap. 4.) The digression on Jacobians explains what det J is doing in the likelihood function. If we heed the moral to the letter and take det J into account, we get into awfully long computations (see Sec. 5.5) in spite of all our original Simplifying Assumptions. Whether computations are long or short, it pays to lay them out in an orderly way. This is a general precept, of course, but its value stands out most dramatically in the present chapter. It pays not only to do computations in an orderly manner but also to perform some redundant ones just in case you might want to check some alternative. Econometricians normally settle down to a specific model only after much experimentation. And, further, redundant computations become necessary when we want to estimate a given promising model by increasingly refined techniques. The wisdom of performing the redundant computations will become fully apparent only after we have dealt with ovcridentified systems, instrumental variables, limited information, and Theirs method (Chaps. 6 to 9). 5.2. Effort-saving notation It pays to establish once and for all a uniform notation for complete linear models of several equations. These are conventions, not assumptions. The endogenous variables are denoted by y'a. There are G endoge- nous variables, called y\ y yt t ■»••■$ Vo an( ^> collectively, y. y is the vector (2/1,2/2, . • . ,2/c). We use g (g « 1, 2, . . . , G) as running subscript for endogenous variables. The exogenous variables are denoted by z'q. There are H exoge- nous variables, called Zi, z% % . . . % zn f and z is their vector. These may be lagged values of the y'a only by special mention. The running subscript of an exogenous variablo is h «■ 1, 2, . . . , H. All definitions have been solved out of the system, so that there are exactly G equations, all stochastic, with errors U\, u%, . . . , uq. u = (ui f u 2t - . . . ,u G ). We speak of the gth equation. The coefficients of the y'a are called /3s, and those of the z'a are called 5.2. EFFORT-SAVING NOTATION 75 7s. They bear two subscripts: the first refers to the equation, the second to the variable to which the parameter corresponds. We get rid of the constant term (if any) by letting the last exogenous variable zh be identically equal to 1 ; its parameter y H then becomes the constant term. In most applications we shall not bother to write the constant term at all. Either it is in the last term y HZH = 1, or it has been eliminated by measuring all variables from their means. B and T represent the matrices of coefficients in their natural order: B - ■]8ii 021 012 022 -001 002 010" 020 ^GQ~ r = 7n 7i2 721 722 -7oi 702 71//" 1 72/f 7o#. A stands for B is always square and of size G XG;T is of sizeCr X H. the elements of B and r set side by side: A = [Br] that is to say, for the matrix of all coefficients in the model, whether they belong to endogenous or exogenous variables. A is of size GX(G + H). x stands for the elements of y and z set side by side. x = (2/1,2/2, ,2/o J Zi,Z2, >zh) that is to say, x is the vector of all variables, whether endogenous or exogenous, but in their natural order. ai stands for the first row of A, on for the second row, etc. ; similarly, for gi, 5a, ... , 5o, 71,72, • . • , Yo- That is, a lower-case bold Greek letter with a single subscript g represents (some of) the parameters of a single equation (the oth) of the system. We reduce the number of parameters to bo estimated by dividing each equation by one of its coefficients. This does not affect the model in any other way. We use the pth coefficient of the oth equation for this, so that W = L Henceforth we shall always take matrix 5 in its ''standardized form" 76 MANY-EQUATION LINEAR MODELS B = ' 1 0u 021 1 01O~ 020 -001 0(72 ' ' ' 1 - A model can be written in a variety of forms: 1. Explicitly, as below (time subscripts omitted) : yi+ Pay % + 02i2/i + y% + + 01(72/(7 + 711*1 + 71222 + + 0202/0 + 72l2l + 722^2 + 4- yiHZn = ui + ImZn - u 2 0(712/1 + 0(722/2 +••••+ Jto + 701«1 + 7(72«2 + 2. In extended vector form) ?iy + yi z = w i ?2y + Y2Z = u z 5oy 4- yoz ■ wo 3. In condensed vector form: ait = Wi at* = Wa + 7off*# ■ Wo (5-D (5-2) (5-3) €10% — Uq 4. In extended matrix form: By + Tz = u 5. In condensed matrix form: Ax = u (5-4) (5-5) 5.3. THE SIX SIMPLIFYING ASSUMPTIONS GENERALIZED 77 Note that, when the context is clear, bold lower-case letters stand either for a row or for a column vector. S Finally, <r gh (t) stands for the co variance of u g (t) with u h (t) ; #&(£) is the matrix of these co variances; and 6° h (t) is its inverse. 5.3. The Six Simplifying Assumptions generalized A laconic mathematician can generalize the Six Simplifying Assump- tions with a stroke of brevity by saying that they continue to apply if we replace the symbol u by u. Our task is to interpret this in terms of economics. In Chap. 1, I discussed the Six Simplifying Properties when there was a single equation in the model and, therefore, a single disturbance. Now we have one disturbance for each equation, and u is the vector made up of them, u(t) = (wi(0,W2(0> • • • » w o(0)» i Assumption 1 "u is a random variable" means that each u (t) is a random variable, that is to say, that all equations remaining after solving out the definitions are stochastic. 1 Assumption 2 "u has expected value 0" means that the mean of the joint distribu- tion is the vector = (0,0, . . . ,0), or that each u has zero expected value. Assumption 3 "u has constant variance" means that the covariances (T 0h = cov (u 0} u h ) of the several disturbances do not vary with time. Assumption 4 "u is normal" means that Wi(0, u 2 (t), . . . , u Q (t) are jointly normally distributed. Assumption 5 "u is not autocorrelated" means that there is no correlation between the disturbance of one equation and previous values of itself. 78 MANY-EQUATION LINEAR MODELS Assumption "u is not correlated with z" means that no exogenous variable — in whichever equation it appears — -is correlated with any disturbance, past, present, or future, of any equation in the model. On these assumptions, the likelihood function of the sample is s L - (2»)-"»(det J) s (det fa*])"*' 2 exp { - % J u,[<^]u.} (5-6) 9—1 which should be compared with (2-2). The analogy is perfect. The expression in the curly braces can also be written « g h Another way to write the likelihood function is ^JJ« B ( 5 K« A W (5-7) L m (2ir)- s >*(det J) s (det M)" 5 ' 2 exp { -\i £ Ax(«)[d'*]x(a)A} (5-8) « which brings out the fact that L is a function (1) of ail the unknown parameters @ h, y h, <r vh y <T h) and (2) of all the observations \(s) (s = 1, 2, . . . , S). The function's logarithmic form S- 1 log L « - Y 2 log 2t + log det J - Y 2 log det [4 gh ] 8 -Vl^I Ax(s)[d'*]x(s)A (5-9) is easier to use. 5,4. Stochastic independence The seventh Simplifying Assumption: 6 gti is a diagonal matrix, or cov (u 0) u h ) = for g j* h is not obligatory, but it is easy to rationalize. It states that the disturbance of one equation is not correlated with the disturbance in any other equation of the model in the same time period — something quite different from Assumption 6. Recall that each random term is a gathering of errors of measure- 5.4. STOCHASTIC INDEPENDENCE 79 ment, errors of aggregation, omitted variables, omitted equation!, and errors of linear approximation. Assumption 7 states that either (1) the #th equation and the /ith equation are disturbed by different random causes, or (2) if they are disturbed by the same causes, dif- ferent "drawings" go into u g (t) and u h (t). This assumption is dearly inapplicable in the following situations: 1. In year t, all or nearly all statistics were subject to larger than the usual errors, because of a cut in the budget of the Statistics Bureau. 2. Errors of aggregation affect mainly national income (because of shifts in distribution), and national income enters several equations of the model. 3. Omitted variables (one or more) are known to affect two (qf more) equations. For instance, weather affects the supply of watermelons, cotton, and whale blubber. Now if the model contains equations for watermelons and blubber, the inclusion of weather in the random term does not hurt, because relatively independent drawings of waather (one in the Southeast, one in the South Pacific) affect these two industries. However, if watermelons and cotton are included in the model, both of these are grown in the same belt, the weather affecting them is one and the same, and Assumption 7 is violated. Assumption 7 simplifies the computations (1) because it leaves fewer covariances to estimate, (2) because det 6 oh becomes a simple product Uffggj and (3) because all the cross terms 1 in (5-7) drop out. This can reduce computations by a factor of 2 or 3 for a model of as few as three equations and by a much greater factor for larger systems. Digression on Jacobians The likelihood function involves a term expressed m det J» the Jacobian of the functions u, say, with respect to the variables y; we have disregarded det J until now, since we have taktn it on faith to be equal to 1. This is no longer true in a many-equation model. Here J is a matrix of unknown parameters, the same /Js, in fact, that we are trying to estimate with the likelihood function. The main ideas behind J are three: 1. If you know the probability distribution of a variable u (or several variables u h w 2 , . . . , Uq), then you can find the proba* 1 Those for which g ^ h. 80 MANY-EQUATION LINEAR MODELS bility distribution of a variable y related to u functionally (or of several y's related functionally to the w's). 2. If the w's and t/'s are equally numerous and if the functions connecting the two sets are one-to-one, continuous, and with con- tinuous first derivatives, then the matrix J of all partial deriva- tives of the form du/dy will have an inverse. 3. If conditions I and 2 arc satisfied, then wo can calculate the joint probability distribution q of the y'a from the known joint probability distribution p of the w's (omitting the subscript t) as follows: (5-10) p(ui,u 2 , . . • ,u G ) du\ du 2 ' ' • dug = det J • p(u h u 2 , . . . ,u ) dy x dy 2 * • • dy or q(yi,y2, . . . ,:!/o) dyi dy 2 • • • dy a = det J • p(u h u 2i . . . ,u ) dyi dy 2 • • • dy G I shall illustrate these three ideas by examples. Example 1 Let u be a single variable whose probability distribution we know to be as follows: Value of u Probability p(u) -4 0.1 -3 0.2 1 0.4 3 0.3 Let y be related functionally to u as follows: y(u) - w 2 - 4w + 3 (5-11) As u takes on its four values, y takes on the corresponding values ?/(-4) = 35, t/(-3) = 24, y{\) = 0, r/(3) = 0. Since we know how often u is equal to —4, —3, 1, and 3, we can find how oiten y is equal to oo, zi, ana u. Value of y 35 24 Probability ^(?/) 0.1 0.2 0.7 5.4. STOCHASTIC INDEPENDENCE $1 Example 2 The same can be done with several y's and u's conneetid by m appropriate set of functions, for instance, 2/i = -t*i -|- Zu\ - Ui 2/2 = e~ ui + log u 2 provided the probability distribution p(wi,Wj,Wa) is known. Relation (5-11) is not one-to-one, since, for every value of y t U can have two values. Accordingly, in Example 1 the second condition is violated, and the Jacobian is undefined. The same is true for Example 2. Whenever the functional relation between the w's and y's is one-to-one, [du/dy] and [dy/du] are single-valued and their determinants multiply up to the number 1. Example 3 y(u) = 3w -f log u — 4 Though it is very hard to express w in terms of y, we know that, since dy/du = 3 + \/u = (3w + l)/w, the Jacobian j = du/dy - w/(3w + 1). Example 4 2/i « — Ui + w 2 2/2 - e~ ui + log u 2 + 5 (5*12) Here we can compute det J from knowledge of det raj- -i L^WJ W2 + e- tt » since it follows that det J = u 2 /(u 2 e-" x — 1). Therefore, by (5-10), the probability distribution of the y's is 9(2/1,2/2) d?/i efo/2 = ^2 w 2 e" WI - 1 p(u h u 2 ) dyx dy* Now, what relevance does all this have to econometrics? Very simple. Let 2/1, yt t • • • > Vq De endogenous variables, and let u h u 2 , . . . , u Q be the random errors attached to the struc* 82 MANY-EQUATION LINEAR MODELS tural equations. The model's G equations are explicit functional relations between the y's and the u'a, like (5-12). Directly, we know nothing at all about the probability of this or that combina- tion of 2/'s. Nevertheless, (5-10) allows us to compute this proba- bility, namely, in terms of J and the probability distribution of the u's. It turns out that the right-hand side of (5-10) involves only the parameters we seek, the observations we can make, and the probability distribution p, which we have already specified when we constructed the model. If the structural equations are all linear, as in (5-1), the matrix J of all partial derivatives of the form du/dy turns out to be nothing but the matrix B itself. 1 #12 * * * 01G 021 1 * ' * 020 J- -&?i 002 * ' ' 1 - - B 5.5. Interdependence of the estimates Now that we know that J = B, we can both find the values /3, 7, <r that maximize the likelihood function (5-9) and compute its actual value. Actually we do not care how large L itself is. Naturally, maximizing such a function by ordinary methods is a staggering job; we won't undertake it. In fact, nobody undertakes it by direct attack. We shall use (5-9) to answer the following question: In order to estimate this particular parameter or this particular equation, do we need to estimate all parameters? The answer, generally, is yes. Note first of all that the maximum likelihood method of estimating B, r, and 6 h differs from the naive least squares method quite radically, because the least squares method does not involve the term log det B at all. In other words, the least squares method, if applied to the model one equation at a time, omits from account the matrix B; it does not allow the parameters of one equation to influence the estima- tion of the parameters of another; nor does it allow the covariances a h 5.6. RECURSIVE MODELS 83 to influence in the least the parameter estimates of any equation that is being fitted. Finally, the least squares technique estimates the covariances & g9 one at a time without involving any other covariance. Contrariwise, in maximum likelihood, the estimates $ of one equation affect the $s and 1 s of another; the &s of one equation affect the /3s and f s of another; and one & affects another. In a word, the sophisticated maximum likelihood method is very expensive from the point of view of computations and is probably more refined than the quality of the raw statistical data warrants. Econometric theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment dye for curry, ping-pong balls for turtle's eggs, and, for Chalifougnac vintage 1883, a can of turpentine. Two courses of action are open to the econometrician who is reluc- tant to lavish refined computations on crude data: 1. Use the refined maximum likelihood method, but reduce the burden of computation by making additional Simplifying Assumptions. 2. Water down the maximum likelihood method to something more pedestrian but not quite so naive as least squares. Limited informa- tion } instrumental variable, and other techniques are available; they are the subject of Chaps. 7, 8, and 9. 5.6. Recursive models If B is a triangular matrix, 1 the model is called recursive; and its computation is lightened, because there are fewer 0s to estimate and because det B = 1. The economic interpretation of a recursive model is the following. There is an economic variable in the system (say, the price of coffee beans) that is affected only by exogenous variables (like Brazilian weather) ; next, there is a second economic variable (say, the price of a 1 B is triangular if p g h = for all g < h. 84 MANY-EQUATION LINEAR MODELS cup of coffee) that is affected by exogenous variables (tax on coffee beans) and by the one endogenous variable (price of coffee beans) just mentioned. Next, there is a third economic variable (say, the number of hours spent by employees for coffee breaks) that depends only! on exogenous variables (the amount of incoming gossip) and (one or both of) the first two endogenous variables but no others; and so on. Exercises 5.A In the recursive system 2/i = 7*i + u V2 = Pyi -f yiZi + y 2 z 2 + v let the Simplifying Properties hold for u, v with respect to the exoge- nous variables. Prove that, if $ is estimated by naive least squares, that is, if S sg ^(vi.gi.«i)(vi.gi.«i) W (Vlt«|.*l)(l/l.«l.*l) then |5 is biased. 5.B In the recursive model x t = Py t + u t y t = yxt-\ 4- v t show that j5 and ^ are unbiased but that least squares applied to the autoregressive equation obtained as a combination of the two equations gives biased estimates. Further readings The notation of Sec. 5.2 is worth learning because it is becoming standard among econometricians. It is expanded in Koopmans, chap. 2. Jacobians are illustrated by Klein, pp. 32-38. The mathematics of Jacobians, with proofs, can be found in Richard Courant, Differential and Integral Calculus, vol. 2, chap. 3 (New York: 1953), or in Wilfred Kaplan, Advanced Calculus, pp. 00-100 (Reading, Massachusetts: 1952). Klein, p. 81, gives a simple example of a recursive model. CHAPTER 6 Identification 6.1. Introduction Identification problems spring up almost everywhere in econometrics as soon as one departs from single-equation models. This chapter far from exhausts the subject. In particular, the next two topics, instrumental variables in Chap. 7 and limited information in Chap. 8, are intimately bound up with it. The identification problem will arise sporadically in later chapters. Though this chapter is self-contained, some familiarity with the subject is desirable. I know of no better elementary treatment than that of Tj ailing C. Koopmans, "Identification Problems in Economic Model Construction," chap. 2 in Hood. I have chosen to devote this chapter to a few topics which, in my opinion, either have not received convincing treatment or have not been put in pedagogic form. The main results of this chapter are the following: 1. There are several definitions of identifiability. I show their equivalence. 2. Lack or presence of identification may be due (a) to the model's a priori specification, (b) to the actual values of its unknown parame- ters, or (c) to the particular sample we happen to have drawn. 85 80 IDENTIFICATION 3. There are ways to detect overidentificatton and underidentifica- tion. These ways are not always foolproof. There are several ways to remove over- or underidentification. 4. In spite of the superficial fact that they are defined in analogous terms, underidentification and overidentificatioh are qualitatively dif- ferent properties: the former is nonstochastic, the latter stochastic; the former can be removed (in special cases) by means of additional restrictions, the latter is handled by better observations or longer computation. 6,2, Completeness and nonsingularity The following discussion applies to all kinds of models, linear or not, largo or small, but it will bo illustrated by this example: 2/i + YnZi 4 71222 + 71323 4 71424 = ui 0212/1 + 2/2 4 0232/3 4- 721Z1 4 72222 4 72323 * Ui (6-1) 0312/1 4 2/3 4 73121 4 73222 = W 8 This model describes an economic mechanism that works somewhat like this: 1. The parameters /? and 7 are fixed constants. 2. In each time period, someone supplies outside information about the exogenous variables z. 3. In each time period, someone goes to a preassigned table of random numbers, and, using a prescribed procedure, reads off some numbers u h w 2 , u s . 4. All this is fed into (6-1). 5. Values for the endogenous variables, 2/1, 2/2, 2/s, are generated in accordance with the resulting system. The last step succeeds if and only if the linear equations resulting from step 4 are independent. Otherwise there is an infinity of compatible triplets (2/1,2/2,2/3). The model is complete if it can be solved uniquely for (2/1,2/2,2/3); otherwise it is incomplete. To generate a unique triplet it is necessary and sufficient that the matrix B be nonsingular, meaning that no row of it is a linear combination of other rows. The economic interpretation of singularity and nonsingularity 6.3. THE REDUCED FORM 87 is very simple. Each equation in (6-1) represents the behavior of a sector of the economy, say, producers, consumers, bankers, buyers, sellers, or middlemen. These sectors respond to exogenous stimuli z and economic stimuli y. They may respond to exogenous stimuli in any way whatsoever. In particular, it is quite permis- sible for them to respond in the same way to all exogenous stimuli (711 = 721 = T3i, 712 = 722 = 732, etc.). But, if the matrix is to be nonsingular, they should respond in different ways to the endogenous stimuli. No sector may have the same parameters as another; no sector's responses may be the average of two other sectors' responses. No sector may be a weighted average of any other sectors, as far aa economic stimuli are concerned. To illustrate singularity, consider a simple economy which consists of throe families responding to three economic stimuli but such that the third family makes an average response. Then B is singular, and the model containing the three families is incomplete. For nonsingularity the sectors must be sufficiently unlike each other. In fact this is the definition of sectors: that they are economically different from one another. Exercise 6. A Prove the following theorems by using the common sense of the five steps of the above discussion: "If Assumption 7 is made, then B is nonsingular," and "If B is singular, Assumption 7 cannot hold." These two statements can be reworded: "An econometric model is complete if and only if its sectors are stochastically inde* II pendent." Appendix E proves this mathematically, but what is wanted in this exercise is an "economic" proof. 6.3. The reduced form Every complete linear model By -f Tz = u can be reduced to y = IIz + v. These two expressions are called the original form and the reduced form. If it is complete, the original model (6-1) can be reduced to 2/1 = 7TnZi -f 7Ti 2 Z2 + ITi&i + TuZa + V\ 2/2 = 7T2lZl + *"22Z2 + Vr&Z + ^24^4 + V% (6-2) 2/3 = 7T 3 lZl + A-32Z2 + Tz&i + ^34^4 + #3 88 IDENTIFICATION Some obvious properties of (6-2) are worth pointing out: Its random disturbances Vi, v* f *>3 are linear combinations of the original random disturbances'and share their'properties. However, the v's have different covariances from the it's. In particular, the v's are interdependent even if the u's were stochastically independent. (We seldom have to worry about the precise relation among the w's and v's.) Unlike the typical original form, each equation of the reduced form contains all the exogenous variables of the model. Each equation of the reduced form constitutes a model that satisfies the Six Simplifying Assumptions of Chap, 1 and, therefore, may validly be estimated by least squares; these estimates are called its. If it is possible to work back from the its to estimate unambiguously the coefficients /3, 7 of the original form, we shall call such estimates /3, -f and say that (6-1) is exactly identified. Finally, the coefficients of the two forms (6-1), (6-2) are connected as follows: — 7ll = 7TH —721 = ^21^11 + 7T21 + 0237T31 731 = /?3l7Tn + 7T 3 i ~7l2 = 1T12 —722 — 0217T12 + 7T22 + 023^32 732 = Pn^l* 4* ^32 ~7l3 = ^13 —723 : = ^21^13 + T23 + ^23^33 = /33lTi3 4" ^33 — 714 = ^14 : = /?2l7Tl4 + 7T24 + 0237T34 = ^ZlTTu + ""34 (6-3) It is possible, but messy, to solve for im, •- • • , xt4 in terms of the 0s and 7s. The important fact is that, in general, all its are a priori nonzero in the reduced form, even if many of the j8s and 7s are a priori zero in the original form. Relations (6-3) can be written much more compactly: -r = Bn (6-4) 6.4. Over- and underdeterminacy As a preview for the rest of this chapter, imagine that (6-1) is complete. If so, its reduced form (6-2) exists and can be estimated by least squares. Let the estimates be tfn, . . . , #34. Now consider the leftmost column of equations in (6-3). Evidently the 7s can be computed right away from the 7rs, uniquely and unam- biguously. We say, then, that 711, 712, 713, 7i4 are exactly identified. Consider next the last two equations of (6-3) ; they give rise to two 6.5. BOGUS STRUCTURAL EQUATIONS 89 estimates of 03i, namely, —#33/^13 and ~*fr u/fut which in general are quite different, no matter how ideal the sample. When this happt m to parameters, we say that they are (or the equation that g@ii&ifi§ them is) overidentified; accordingly, system (6-3) over deter inirm #31. Consider now the middle column of (G-3). Its four equations underdetermine the five unknowns 2 i, 1823, 721, 722, 723. An equation to which such parameters belong is underidentified. Obviously, then, the identification problem has something to do with the number of equations and unknowns in the system — F m Bn. The Counting Rules of Sec. 6.7 will show this more precisely, 6.5. Bogus structural equations Consider the supply-demand model SS (Supply) yi + p l2 y 2 - u x DD (Demand) 2 i2/i + 2/2 = u 2 "**' where y\ represents price and y 2 represents quantity; linear combina- tions of the true supply and demand are called bogus relations and are branded with the superscript © . A bogus relation may parade either as supply or as demand, SS® = j(SS) + k(DD) DD® = m(SS) + n(DD) where j, k, m, n are unknown numbers, but suitable to make the standardized coefficients pf lt pf 2 of the bogus relations equal to 1. The bogus coefficients are connected with the true coefficients as follows: Pfi - J + W21 = 1 /3? 2 - jfiit + k 0® - m -f- nfai pf 2 = m/3i2 + n = 1 The bogus supply contains a random term u? - jui + ku 2 and the bogus demand contains an analogous term uf = mu\ + nu 2 Later on we shall use the following relations between the eovariances ! 90 IDENTIFICATION of the bogus and the true disturbances: var uf « j 2 var u x + 2jk cov (u h u 2 ) + k 2 var w 2 var uf = m 2 var U\ -f 2mn cov (1*1,^2) + n 2 var u 2 (6-6) cov (wf^f ) = jm var Ui -f* (jn + m/c) cov (1*1,^2) + ^n var w 2 6.6. Three definitions of exact identification The discussion that follows is meant to apply to linear models only. Some results can be extended to other types of models (but not in this work). A model or an equation in it may bo cither (exactly) identified, or underidcntificd, or over-identified. Setting aside for the moment the last two cases, here are three alternative definitions of exact identifica- tion, one in terms of the statistical appearance of the model, ~) in terms of maxima of the likelihood function L, and one in terms of the probability distribution of the endogenous variables. Definition 1. A model is identified if its structural equations "look different" from the statistical point of view. An equation looks different if linear combinations of the other equations in the system cannot produce an equation involving exactly the same variables as the equation in question. Thus the supply-demand model (6-5) is not exactly identified, because both equations contain the same variables, price and quantity. In the model SS 2/1 + /3i2?/2 + 7ii*i *= ui DD /?2i</i+ 2/2 =w 2 {p '° where t\ represents rainfall, a linear combination of SS and DD contains the same variables as SS itself. Not so for DD, because every nontrivial linear combination introduces rainfall into the demand equation. In this model the demand equation is identified, but the supply is not exactly identified. In such cases the model is not exactly identified. Definition 2. A model is identified if the likelihood function L(S) has a unique maximum at a "point" A = A . This means that, if you substitute the values «° in L, L is maximal; at any other point L is definitely smaller. Similarly, an equation is exactly identified if the likelihood function L becomes smaller when you replace the set a£ of 6.7. A PRIORI CONSTRAINTS ON THE PARAMETERS 91 that equation's parameters by any other set of «J. This way of teking at the matter is presented in detail later on, in Sec. 6.12. Definition 3. Anything (a model, an equation, a parameter) is called exactly identified if it can be determined from knowledge of the conditional distribution of the endogenous variables, given th© exoge- nous. This is to say, it is identified if, given a sample that wag large enough and rich enough, you could determine the parameters in question. We know that, no matter how large the sample or how rich, we could never disentangle the two equations of (6-5). All three definitions appear to say that exact identification is not & stochastic property, for it does not seem to depend on the samples w© may chance to draw. We shall return to this question later on. One must be very accurate and careful about the terminology. Over-, under-, and exact identification are exhaustive and mutually exclusive cases. Identified means "either exactly or o vender* klfkd." Not identified means "underidentified." Underidentification occurs when: By linear combinations of the equations one can obtain a bogus equa- tion that looks statistically like some true equation (Definition 1). The likelihood function has a maximum maximorum at two or more points of the parameter space (Definition 2). Knowledge of the conditional distribution of the endogenous variables, given the exogenous, does not determine all the parameters of th© model (Definition 3). There are three principal ways to avert (or at least to detect) absence of exact identification: (1) constraints on the a priori values of the parameters; (2) constraints on the estimates of the parameters; (3) constraints on the stochastic assumptions of the model. 6.7. A priori constraints on the parameters Two new symbols will speed up the discussion considerably. Sup- pose we are discussing the third equation of a model. A single asterisk will denote the variables present in the third equation, a double asterisk, those absent from the third equation. Asterisks can be attached to 92 IDENTIFICATION variables, to their parameters, or to vectors of such variables and parameters. The asterisk notation has now become standard in econometric literature, and Appendix F gives a detailed account of it. The commonest a priori restrictions on A are (1) zero restrictions, like 724 = 0; (2) parameter equalities in the same equation, for example, 721 = 722; (3) other equations involving parameters of several equations a 'priori. These cases have economic counterparts, which I proceed to illustrate. Zero restrictions Zero restrictions are common and handy. A zero restriction says that, for all we know> such and such a variable is irrelevant to the behavior of a given sector. If nothing but zero restrictions are contemplated, then we have a handy counting rule (Counting Rule 1) for telling whether an equation is identified. If an equation of a model contains all the variables of the model, it is underidentified, because linear combinations of all the equations look statistically just like it. To avoid this underidentification, the follow- ing two conditions are necessary: 1. That some variables (call them #**) be absent from this equation. 2. That the variables (call them x*) present in the equation in question, whenever they appear in another equation, be mixed with at least one £**, In (6-1) the first equation is identified, because any intermixture of the second equation brings in variables y** and y** (double-starred from the point of view of the first equation), and intermixture of the third equation brings in y** t which is absent from the first equation. In (6-1) the second equation is underidentified, because the third equation can be merged into it without bringing in any variable that is not already in the pure, uncontaminated second equation. Finally, the third equation is identified, because an intermixture of the first equation introduces zj* and zf*, and intermixture of the second equation introduces ?/** and z** (the double stars are now from the point of view of the third equation). This example shows that underidentification can be detected by* checking whether given strategic parameters in the model are specified a priori to be zero or nonzero. This justifies the following statement: 6.7. A PRIORI CONSTRAINTS ON THE PARAMETERS 93 Counting Rule 1. For an equation to be exactly identified it is necessary (but not sufficient) that the number of variables absent from it be one less than the number of sectors. Thus, if G* and H* are, respectively, the number of endogenous and exogenous variables present in the gth equation, then for the gth equation to be identified it is necessary (but not sufficient) that G + H-Q3* + Hj) = G - 1, or that H - H* « G* - 1. Parameter equalities in the same equation Another quite common a priori restriction is to set two or more parameters of a given equation a priori equal. For instance, let us interpret (6-1) as a model of bank behavior, where Z\ represents balances of banks at the Federal Reserve and Zi represents balances of banks at other banks. It is conceivable that a commercial bank may conduct its loan policy by looking at its total balances and not at whether they are held at the Federal Reserve or at another bank. The restriction would be expressed 721 ~ 722. On the other hand, some other sector, say, foreign banks, may treat the two kinds of balances differently, 732 5^ 731. Under these conditions, if the third equation is intermixed with the second equation, the result cannot masquerade as the second equation, because the bogus second equation would have different coefficients for z\ and z 2 , contrary to the a priori assumption that the response to all balances (Federal Reserve and other) is identical. Linear equations connecting the parameters of different equations Suppose that a model contains a production function and an equa~ tion showing the distribution of national income by factor shares. Then the coefficient of the share of labor is a priori equal to the labor coefficient of the production function, on the grounds of the marginal productivity theory of wages. Collectively, all the linear restrictions on A discussed so far can be capsuled into Counting Rule 2. Let A** be what is left of A if we throw out the columns corresponding to the variables present in the gth equation. Counting Rule 2. For the gth equation to be exactly identified it is necessary and sufficient that the matrix A** have rank G — 1. 94 IDENTIFICATION These tests and counting rules can (and should) be applied before you start computations. There are no convenient counting rules for nonlinear restrictions on the parameters. Inequalities such as a > or a > do not help to remove under- identification. For instance, knowledge that demand is downward- sloping and that supply is upward-sloping does not help to identify the model (6-5). 6.8. Constraints on parameter estimates Consider again the supply-demand model (6-7). It states a priori that rainfall influences supply and not demand; and this restriction identifies the demand equation (but not the supply). Now imagine that you draw an unlucky sample made up of cases where the other random elements U\ have annihilated the theoretical effect of rainfall. You will get y ii (for this sample) equal to zero. The sample has behaved as if rainfall did not influence supply, i.e., as if the model were reduced to (6-5), where the demand was statistically indistinguishable from the supply. The moral of this is: If you are not a priori certain that supply is influenced by rainfall (not only theoretically but also in the sample period) then do not proceed with the estimation of demand. If you fear that rainfall fails to affect supply (whether in the sample or generally), then to estimate the demand introduce in the supply function another variable z 2 (say, last year's price of a competing crop) that you are certain influences (if ever so little) this year's supply both theoretically and in the sample period. The new model then is SS V\ -f 0122/2 + TllZl + 712^2 « wi ( „ g . DD foit/i +2/2 = u 2 and so last year's price takes on the burden that rainfall is supposed to carry in making the demand identifiable. A very neat extension of Counting Rule 2 covers all these require- ments: For exact identification , the ranks of A** and A** must equal G - 1. We can show this in a third way. 1 If, in the original model (6-7), 1 With acknowledgments to T. C. Koopmans, in Hood, pp. 31-32. 6.9. CONSTRAINTS ON THE STOCHASTIC ASSUMPTIONS 95 Yn is truly nonzero, then it is impossible to construct a bogus demand equation without detecting it. Take as the bogus demand DD® = %DD + %SS Then the bogus random term of demand is © 2ui + u 2 7n . Then cov (uf,z\) is not zero, and will show up in estimation, unless the sample is the unlucky one in which rainfall is neutralized (f n ~ 0) by the random factor. If, upon completing the estimation, we discover that mj,.,, is quite different from zero, then we can detect underidentification but we cannot remove it. On the other hand, the discovery that w;,.,, is nearly zero is no guarantee that we have identified demand if there is a strong reason to suspect that supply is unaffected by rainfall. 6.9. Constraints on the stochastic assumptions Let the random terms of (6-5) satisfy Simplifying Assumptions 1 to 7, so that cov (ui f u 2 ) = 0. Will this help to identify the supply SS? Sometimes. Suppose that we knew beforehand that var u\ 9 cov (ui,u 2 ), var u 2 were of the orders of magnitude 3, 0, 10, respec- tively. The "deception" can be detected from (6-6) if 2(jB? ) 2 , which is the estimate of var uf, is very different from 3. This can have happened by chance in the sample used, but it becomes more and more unlikely the more 2(j2?) 2 differs from 3. On the other hand, the bogus variances and covariances may have nothing peculiar about them — indeed they may equal 3, 0, and 10, respectively, because of a special set of values that j, k, m, and n have taken on, for example, J — %> -h = }4> m ■■ — M> n — K- Therefore, in general, there is no guarantee that SS® will look statistically different from SS, even if we have complete knowledge of the underlying covariances of the random term. Another way to impose identification on a model is to say something specific about the variances of the random terms. This was done by 96 IDENTIFICATION Schultz in some early studies of agricultural markets. 1 In some of Schultz's work, both supply and demand are functions of the same two endogenous variables (price and quantity) and of random shocks. However, supply is more random than demand. Then the scatter of observed points will be more in agreement with the demand than with the supply function. Ambiguity is not eliminated entirely, but it is reduced as the randomness of supply increases relative to the random- ness of demand. In the notation of (G-6), the restriction takes the form var U\ ■ q var w 2 ; and identification improves with increase in q. More complex restriction of this kind could also help. To summarize the results of Sees. 6.7 to G.9: 1. Identification can be checked before computing by use of the Counting Rules as applied to A. 2. If you fear that art equation is underidentified because you are not sure whether a given variable x reacts significantly, estimate the equa- tion anyhow and then check whether the covariance of x with the residual % is near zero; if not, you may have identified the gth equation. If m x .u a is near zero, you have not identified your equation. If the numerically largest determinant of rank G — 1 from A** is close to zero, X probably did not play a significant role. 3. There are tests that help detect underidentification. 4. It is sometimes possible to remove underidentification. 6.10. Identifiable parameters in an underidentified equation When an equation is underidentified, is it perhaps possible to identify one or more of its parameters, though not all? For instance, what about the identifiability of 7n in (6-7) ? Intuition says that 711 cannot be adulterated by linear combinations of DD, since Z\ occurs only in the supply SS, Intuition is wrong if it concludes that this fact makes 711 identifiable. Applying (6-4), we havo — 7ll = 7Tll -f* 0127T21 The TS can be computed from the reduced form 1 Henry Schultz, The Theory and Measurement of Demand, pp. 72-81 (Univer- sity of Chicago Press, Chicago: 1938). 6.10. IDENTIFIABLE PARAMETERS IN UNDERIDENTIFIED EQUATION 97 2/i -» vnZi + Vi Vl «■ 7T2121 + Vf but /3i2 is and remains unknown and, therefore, so does ?n. So, contrary to intuition, the fact that a given variable enters one equation of a model and no others does not make its coefficient identi- fiable. Underidentification is a disease affecting all parameters of the affected equation. For, if the gth. equation is unidentified, this means that there are fewer equations than unknowns in the gth row of formula (6-4). All coefficients of the ^th equation enter (6-4) sym- metrically, and so none can have a privileged position over the others. Let us now ask whether we can identify, in an otherwise unidenti- fiable equation, the ratio of two unidentifiable coefficients. In special cases it may be both important and sufficient to know the relative rather than the absolute impact of two kinds of variables. Let us consider (6-8) as a model of the supply and demand for loans, where ?/i is quantity of loans, 7/2 is interest rate, Z\ is balances at the Federal Reserve, and zi is balances at foreign banks. We are curious to know whether the two kinds of bank balances differ in their effects on the loan policy of a commercial bank. Is it possible to identify t\\/y\%! No, because (6-4) applied to this model yields ""Til = Til 4" 012^21 — 712 = 7Ti2 + 0127T22 which cannot be solved for 711/712 so long as 0i 2 is unidentified. The most we can get is the relation 7u 4- frn __ Y12 -f- ^12 7T21 7T22 which is a straight line in the 711,712 space, giving an infinity of pairs (7n, Y12). Exercises 6.B Derive explicitly the equations — r = Bn for (6-8). 6.C In the above exercise, compute the two values of 0n in terms of the coefficients of the reduced form. Under what arithmetical conditions would they be identical? Interpret this in economic terms. 98 IDENTIFICATION 6.11. Source of ambiguity in overidentified models Let us return to (6-S), rewriting it for convenience SS q + Pup + ynr + 7nC - u x (Q Q . DD n q + V - w 2 K ~ } where q ,«■ quantity, p ■* price, r = rainfall, c — last year's price of a competing crop. Supply is underidentified, and demand is over- identified. For the latter we get from the reduced form two incom- patible estimates of the single unknown /3 2 iJ a' — ~ *21 off _ ~^22 P21 — . "^ — ■ P21 5 — 7Tn 7T12 But why should the reduced form, if estimated by least squares, give two values for 2 i> the price elasticity of demand? The answer is in terms of the wobblings of the supply function. In (6-9), supply wobbles in response to random shocks u\ and to two unrelated exoge- nous variables, this year's rainfall r and last year's price c of a compet- ing crop. In Fig. 10a I have drawn some supply curves corresponding to different amounts of rainfall (+1, —1) for a fixed value of c (= 0). Observable points fall in the parallelogram ABCD. On the other hand, in Fig. 106 the variations in supply come not from rainfall (which is held constant at 0) but from last year's price only. Observations fall in the parallelogram EFGH. The first estimate of (3 _ ftl .*!!.. *te*±* ( 6 .10) #11 ™>(q.c)-(r,c) corresponds to the broken line in Fig. 10a, because #21/^11 correlates price and quantity reactions as they result from variations in rainfall only. The other estimate off __ #22 m( r , p ).(r, c ) /n in —P21 — r~ — \p-Li) #12 M>(r,q)'(r,c) corresponds to the broken line in Fig. 106, because it correlates p to q as a result of variations in last year's price alone. 1 The sample must be very peculiar indeed that yields equal estimates ff n and £&. 1 In expressions like (6-10) and (6-11), the heuristic device of canceling the "factors" c and r in numerator and denominator gives a correct interpretation of what is being correlated, provided that these "factors" appear on both sides of both dots. 6.11. SOUBCE OF AMBIGUITY IN OVERIDENTIFIED MODELS 99 The explanation, then, is at bottom simple: When demand is overidentified, this means that both rainfall r and lagged prim c rrmke the supply shift up and down, and trace the demand relationship for us. The original form of the model shows this. The reduced form, how- ever, does not allow us to trace the demand uniquely, as the result of the combined effect of rainfall and lagged price. Rather, the reduced form gives us a choice of estimating the slope of the demand equation either as a result of rainfall-induced variations in supply or m a result of lagged-price-induced variations in supply. Essentially, then, either alternative leaves out some crucial consideration, namely, the fact that sA ^^^V ■>JK1r ^^^ J^^Bj^ yPS" *X^<_^ Slope % — Mg (a) (ft> Fig. 10. Ambiguity in an overidentified equation. the omitted variable (lagged price and rainfall, respectively) also affects the price and quantity combinations that the sample shows. To show that (6-10) is a biased estimate, write p = u% — @n$> Then W(p,c).(r,c) = m>(u t ,c)>(r,c) ~ p21™>(q,ch(r,c), and SO -"•021 = ~021 + m( Ut ,eHr,e) W(,, c ).(r,e) The expected value of the bias term is not zero. This is easily seen from (6-9). Let r, c, and wi be fix^d, and let u* take on a set of con- jugate values 4-W2 and — u 2 ?* 0. Then, in (6-9), q necessarily takes on two different values q' and q", and thus the above denominator changes as ui takes on its conjugate values. Therefore, m(+u„eMr,0)/ftl(c',«Kr.«) and W(_ Ul , c ).( ri< j)/m( a '\ C ).( r , C ) do not add up to zero. To show that 100 IDENTIFICATION fi' n is a consistent estimate, consider that flim m< tt| , C ).(r.e) ■=• but Plim m( fllC ).( r , ) ?* 0. Exercises 6.D If it turns out that #' 21 = /5ji, the sample moments must satisfy either the equation m rr m cc = m re m re or the equation m pc m qr = m pr m qe . The first of these declares that rainfall and last year's price are perfectly correlated in the sample. Interpret the second one. Hint: Use the fact that p = u 2 — fog. 6.E If least squares are applied to the reduced form, obtaining tFs, prove the following: (1) that all parameters £, y that can be esti- mated by working back from the reduced form are consistent; (2) that all 7s that can be estimated (whether uniquely or ambiguously) are in general biased; (3) that all /3s that can be estimated (whether uniquely or ambiguously) are in general biased. In the following exercises, p (price) and q (quantity) are the endoge- nous variables. The exogenous variables are i (interest rate), / (liquid funds), and r (rainfall). 6.F Show that in SS q 4- pup = u (exactly identified) DD 02i# 4- p + y n i *= v (underidentified) only 0i2 can be estimated unambiguously. 6.G From the model SS q -f Pup 4- 7nr = u (overidentified) DD j5 2 i? 4- p 4- 72i« + 722/ = v (exactly identified) the reduced form leads to the following estimates of £12: at ' *n oft _ *« #21 7T22 where § u ■■ m( 7 ,/, r ).(v,/, r )/m(.-,/, r ).( t -,/, r )and7F 2 i = wi( P ,/,r).(» , ,/,r)/w«,/,r).(» , ,/.r). Show that these estimates are biased and consistent. 6. II In Exercise 6.G, find the bias (if any) for ^13, foi, ^22. 6.12. Identification and the parameter space The likelihood function L(S) may or may not have a unique highest maximum as a function of the parameter estimates A. If it does, the model is (exactly or over-) identified. Fig. 11. Maxima of the likelihood function, a. Underidentificdj b, exactly identified; c. overidentified. 101 102 IDENTIFICATION Along the axes, labeled a and in Fig. 11, let me represent the parameter space. Usually this space has mdre dimensions, but I cannot picture these on flat paper. Underidentification is pictured in Fig. 11a. Here the mountain has either a flat top T or a ridge RR f , or both. Its elevation is highest in many places rather than at a single place; i.e., there are many local maxima. This means that several values of a and /3 are candidates for the role of estimates of the true a and /3. In the picture these can- didates lie in the cobra-like area PP'Q that creeps on the floor. When the system is (exactly or over-) identified, nothing of this sort happens. The mountain has a single highest point. If the system is exactly identified, this fact is the end of the story, and Fig. 116 applies. When the system is overidentified, then Fig. 1 lc applies. The moun- tain in Fig. 1 lc is the same as in Fig. 116, but we have several conflicting ways to look for the top. One estimating procedure allows us to look for the highest point of the mountain along, say, the 38th parallel; another equally admissible procedure tells us to look for it not along the 38th parallel but along the boundary XY between area I and area II. Accordingly, we get P' and P", two estimates of P that correspond to j3' 21 and flj'i of equations (6-10) and (6-11). 6.13, Over- and underidentification contrasted The example in the figure suggests that overidentification and underidentification are not simple logical opposites, except in a very trivial sense — in relation to the Counting Rules. Table 6. 1 gives the contrasts among over-, exact, and underidentification. We say that underidentification is not usually a stochastic property, because it arises from the a priori specification of the model and not from sampling, and so it cannot be removed by better sampling. Stochastic underidentification is in the nature of a freak; it was illustrated in Sec. 6.8. On the other hand, overidentification is a stochastic property that arises because we disregard some information contained in the sample. Overidentification is removed if all the information of the sample is utilized — which means that reduced-form least-squares estimation must be abandoned. 6.14. CONFLUENCE 103 Table 6.1 Degree of identification Underidentifi- Exact OvERIDENTIF!- CATIDN IDENTIFICATION CATION Unique maximum of the likelihood func- tion Does not exist Exists Exists A -priori restrictions for locating single high- est point Not enough Enough Too many Ambiguity, if any, introduced because: You have not enough inde- pendent varia- tion in supply and demand No ambiguity In reduced form you disregard one or another eauae in the variation of supply Estimate of tho parameters if based on re- duced form /3s Biased, con- sistent j Biased, con- sistent Biased, consistent 78 Biased,* con- sistent. Biased,* con- sistent Biased,* consistent Is the degree of identi- fication a stochastic property? Not usually; yes, if in fact a variable fails to vary Yes In special cases, unbiased. 6.14. Confluence Multicollinearity and underidentification are two special cases of a mathematical property called, confluence. Multicollinearity arises when you cannot separate the effects of two (or more) theoretically independent variables because in your sample they happen to have moved together. This topic is taken up again in Chap. 9. To show the connection between underidentification and multi- collinearity, I shall use a model adapted from Tintner, p. 33, which contains both. 104 IDENTIFICATION Suppose that the world price of cotton is established by supply and demand conditions in the United States. Let the supply of American cotton q depend only on its world price p, while the American demand for cotton depends both on its price and on national income y. DD q - ap + #?/ -f u SS q-yp + v (6-12) How, demand in this model is underidentified. If the sample comes Fig. 12. Confluence. from earlier years, when cotton was king, then the model, in addition, suffers from multicollinearity, because the national income was strongly correlated with the price and quantity of cotton. In the parameter space for a, £, and 7, the likelihood function L has a stationary value over a region of the space afiy. To picture this (Fig. 12) let us forget 7 — or assume that someone has disclosed it to be -f-0.03. The true value of the parameters is the point (a,/?,0.03). The ambiguous area, over which L has a flat top, is the band PQRS in Fig. 12. If a sample is taken from more recent years, multicollinearity is reduced, because national income y and the world price of cotton p are no longer so 6.14. CONFLUENCE 105 strongly correlated as before. In Fig. 12, the gradual diversification of America's economy would appear as a gradual migration of the points in the band PQRS toward a narrower band around the curve MN. If the time comes when cotton becomes quite insignificant, then multi- collinearity will have disappeared, but not the underidentiflcation. In the figure, the band will have collapsed to the curve M 2V, but not to a single point. Exercise 6.1 Suggest methods for removing multicollinearity and discuss them. Digression on the etymology of the term "multicollinearity" Suppose, for purposes of illustration only, that in (6-12) the true values of the parameters are simply a = — 1, = 1, 7 » 1. Also suppose that national income and the price of cotton are connected by the exact relation V = 3p (The exactness is for illustrative purposes only. What follows is also true for the stochastic case y = 3p -J- w.) Then the demand can be written DD q = 2p + u Now, the following estimates of a and are consistent with all observations: 1. The true values a = — 1, = 1 2. The pair of values a == 1, = }4 — because q = lp + %y 4- u = 2p -f u 3. The pair of values a = 2, j3 = 0; and an infinity of other pairs, which can be represented as the collinear points of line AB in Fig. 13 If we take a bogus demand function, then its parameters a©, /3® also form a collinear set of points — like line CD — that agrees with the sample. Removing multicollinearity causes EF to collapse into M , AB 106 IDENTIFICATION into N, and CD intoP; that is to say, removing multicollinearity collapses the band between the lines EF and CD into the line MNP. On the other hand, removing underidentification col- lapses the same band into the line AB. Nomu!tfcol!inearity^ v P C Fig. 13. Multicollinearity. Further readings Koopmans, chap. 17, extends and refines the concept of a complete model. The excellent introduction to identification, also by Koopmans, was cited in Sec. 6.1. Klein gives scattered examples with discussion (consult his index). It is worthwhile to read the seminal article of Elmer J. Working, " What Do Statistical 'Demand Curves' Show?" (Quarterly Journal of Economics , vol. 41, no. 2, pp. 212-235, February, 1927), both for its contents and in order to appreciate how far econometrics has progressed since that time. Trygve Haavclmo's treatment of confluence, in "The Probability Approach in Econometrics" (Econometrica, vol. 12, Supplement, 1944), is hard going, but an excellent exercise in decoding. If you have come this far, you can tackle this piece. Those who appreciate the refinement and proliferation of concepts and are not afraid of flights into abstraction may glance at Leonid Hurwicz, "Generalization of the Concept of Identification," chap. 4 of Koopmans. Herbert Simon, "Causal Order and Identifiability," in Hood, chap. 3, shows how an econometric model can be analyzed into a hierarchy of sub- models increasingly more endogenous and how the hierarchy accords with the statistical notion of causation and that of identifiability. CHAPTER 7 Instrumental variables 7.1. Terminology and results The term instrumental variable in econometrics has two entirely unrelated meanings: 1. A variable that can be manipulated at will by a policy maker as a tool or instrument of policy; for instance, taxes, the quantity of money, the rediscount rate j 2. A variable, exogenous to the economy, significant, not entering the particular equation or equations we want to estimate, nevertheless used by us in a special way iiji estimating these equations In this work only the second 'meaning is used. This chapter explains and rationalizes the instrumental variable technique. It shows: j 1. That the technique, though at first sight it appears odd, is logically similar to other estimating methods and also quite reasonable 2. That, if the choice of instrumental variables is unique, the model is exactly identified, and that the instrumental variable method is equivalent to applying least; squares to the reduced form and then solving back to the original form 107 108 INSTRUMENTAL VARIABLES To understand the logic of instrumental variables we must first take a deep look at parameter estimation in general. 7.2. The rationale of estimating parametric relationships Two ideas dominate the strategy of inferring parametric connections statistically. The first idea is that variables can be divided into causes and effects. The second is that conflicting observations must be weighted somehow. Causes and effects There can be one or more causes, symbolized by c, and one or more effects, symbolized by c. Various "instances" or "degrees" of c and e will carry a subscript. A parameter is nothing more than the change in effect (s), given a change in cause (s). Symbolically this can be represented as follows: ~ . change in effect (s) Ae ,- 1N Parameter = — — r ^ rr = t- (7-1) corresponding change in cause (s) Ac This relation 1 is fundamental in Chaps. 7 and 8; the general theme of these chapters is that all techniques of estimation are variations and elaborations of (7-1). The change in cause (s) and effect (s) can be any change whatsoever. For instance, the change in effect e may be e\ — 62, or €2 — Ci, or 0i5 — 026? in general, e t — e*. The corresponding changes in cause c are d — c 2 , and c 2 — Ci, and C15 — 025; in general, c t — ce. Usually, however, the change is computed from some fixed reference level of the effect (s) or the cause (s) — and this fixed level is most typically the mean. 2 So parameters are usually computed by a formula like (7-2) rather than (7-1). Parameter estimate = ■ = ft (7-2) Ct — c N 1 It is meant to be a conventional relationship. Only in the simplest linear systems is it true that the numerical value of a parameter can be expressed as simply as in equation (7-1). 9 Ideally, the mean of the population. In practice, the mean of the sample. In linear models the distinction is immaterial for most purposes. 7.2. THE RATIONALE OF ESTIMATING PARAMETRIC RELATIONSHIP! 109 This is merely a convenience, which does not affect the logic of par&me- ter estimation. Henceforth, all symbols Ac, Ac, e tf c h or dimply e and c represent deviations from the corresponding mean. The problem of conflicting observations What happens when two or more applications of (7-2) give different values for x? This is very likely to happen in stochastic models, because in such models the effect e results not merely from the explicit cause c but also from the disturbance u. Which of the several con- flicting values should be assigned to the unknown parameter *■? The problem arises in all but the simplest cases of parameter estimation. I In general, the parameter estimate is a weighted average of quotients of the form e/c. Take the model e t — yc t 4- u t . Then the weighted estimate of 7 is 1 7 = — wi + — W2 + • • • H — u>i C\ ,C2 c, Any set of weights Wi, W2, . . J , w, will do, provided they add up to 1. If you want to attach much significance to the instance ew/cw, make wn large; if you want to disregard the instance 627/027, make Wn m or even negative. | Let us for simplicity restrict ourselves to just two observations. One of the many possible sets of weights is the following: ilL w m -A ~r : C2 C\ T* Cj u>i - irxb ^2 - :txt« C?-3) Then, using these weights, the weighted estimate is also tha familiar least squares estimate, because Ci c\ , 62 j c\ e\C\ + C2C2 ___ tn w Ci cf + c| c 2 c[ + c§ " c| -f- c| "" ?/l<* So the least squares estimate amounts to nothing more than a special method for weighting; conflicting values of the ratios e/c. Now two questions arise: (1) Why should the weights (7-3) be functions of the c's or have anything at all to do with the cause c? and (2) why M c 2 \o'\ c { Vki log c 2 S|c| Zc 2 2|c 3 | Xo* s VR 2 log c 2 110 INSTRUMENTAL VARIABLES should we pick those particular formulas and not, for example, absolute values Wl = N „, 2 = N ( 7^) |ci] 4- \Ci\ \ci\ + |cs| v ' or cubes, square roots, or logarithms? The answer to question 1 is that the more strongly the cause departs from its average level, the more you weight it. It is as though we said that the real test of the relationship e t = yc t + u t is whether it stands up under atypical, nonaverage conditions (i.e., when c t is far from c). But now why should one [as formulas (7-3) and (7-4) suggest] give equal weight to +c t and — c<? The common-sense rationale here is that the same credence should be given to a ratio e/c when c is atypically small (relative to its mean) as when it is atypically large. This requirement is met by an evenly increasing function of c, for instance: (7-5) The answer to question (2) is this: From the many alternative formulas in (7-5), c 2 /2c 2 is selected because of the assumption that ti is normally distributed, in which case least squares approaches maximum likelihood. A different probability distribution of the disturbances would prescribe a different set of weights for averaging or reconciling conflicting values of e/c. 7.3. A single instrumental variable With these preliminaries well digested, we are ready to understand the logic of instrumental variables. Suppose that Vt - fat + u t (7-6) is a model for the demand for sugar. This equation is known to have come from a larger system in which p (price) and q (quantity) are endogenous and a great many other causes are active. We call (7-0) the manifest model and all the remaining relationships the latent model. Suppose that z represents some exogenous cause affecting the economy, say, the tax on steel. Now in an interdependent economic system the tax on steel affects theoretically both the price of sugar and the quantity 7.3. A SINGLE INSTRUMENTAL VARIABLE 111 of sugar bought, because a tax; cannot leave unaffected either the price or the quantity, of any substitute or competitive goods, and sugar surely must be somewhere at the end of the chain of substitutes or com- plements of steel. ! The method of instrumental variables says that you ought to compute an estimate of y (from two observations 8 = 1, 2) as follows: (7-7) (7-8) y ** fir Wl + t* w * $1 Q2 using as weights 1 9i*i '■ w* - qiZ * W\ — 9i* 1 4" 9222 W2 . qiZi + qzZi so that the estimate of y is 1 r _ Pi 9i*i qi q\Z\ + 52*2 + P2 92 \q\Z q&2 __ V\Z\ + P2^2 t 1 + 92*2 9i*i + 92*2 3 2*5 Every ounce of common sen&e in you ought to rear itself in rebellion at this perpetration. You ought to protest, saying: "Nonsen.^e! My boss at the Zachary Sugar Refinery will fire me unceremoniously from my well-paid and respected job of Company Economotrician if I tell the Vice-President that I multiply the price and quantity of sugar with the tax on steel to estimate demand for sugar! Better give me a good argument for this act of alchemy. Moreover, did it ever occur to you that m zq in the denominator could conceivably be equal to zero-— and certainly is in some samples? 1 This predicament could not arise in leastsquare weights like (7-3). " I hasten to reply to the last point first. The possibility that m q , might be zero is the reason why z should not be chosen haphazardly but rather from exogenous variables that have a lot to do with the quantity of sugar consumed. So, instead of the tax on steel, perhaps we ought to take the tax on coffee, honey, or sugar, or the quantity of school lunches financed by Congress. Still, it is possible that, in the sample chosen, the quantity of sugar q and the quantity of school lunches z happen to be uncorrected, and it is true that this sort of difficulty m unheard of in least squares, because m qq j ust cannot be zero : the quantity of sugar is perfectly correlated with itself. 1 This sample would be a set of measure zero, as the mathematicians say* 112 INSTRUMENTAL VARIABLES Now what do the weights (7-9) say? They say that, the more sugar consumption and school lunches move together, 1 the more weight should be given to A-p/Aq, the price effect of the change in the quantity consumed. There is another way to look at this matter: Write, purely heuristically, Ap/Az 8 - ^ = Aq Aq/Az (7-10) where Ap/Az is symbolized by 71 and Aq/Az by 72. How could we possibly interpret 71 and 72? Figure 14 represents both the latent and the manifest parts of the S v' / / X \ *2 X \ Fig. 14. The logic of the instrumental variable. model (7-0). The solid arrow represents the manifest mutual causa- tion p «-» q f which appears in (7-6). The broken arrows represent the latent model, which is not spelled out but which states that z affects p and q through the workings of the whole economic system. So the meaning of 71 and 72 is that they are the coefficients of the latent model. Since z is exogenous, why not estiniate 71 and 72 by least squares? It sounds highly reasonable. There should be no objection to this. ?i == m t m t Insert (7-11) in (7-10), and then * _ Ap/A z P Aq/Az 7i 72 ntzq 72 = — - m Z p/m z m tp m zq m zq /m t which justifies the instrumental variable formula (7-9). 1 In a given instance, not over-all. (7-11) (7-12) 7.4. CONNECTION WITH THE REDUCED FORM 113 It is well to ask now: in what way do the two examples p ■ fiq 4- u . and e = yc -f* w differ? Why may the second model be computed by least squares although the first has to be estimated by instrumental variables? The reason is that the second c — > e is a complete model in itself: causation flows from cjto e, and nothing else is involved. Oa the other hand, p = Pq + w hides a lot of complexity, i.e., causation from z to p and from 2 to q. These causations are unidirectional z -*p and z --* 5 and can be treated like c — > e; but p «-» g is not of the same kind, and must be treated by taking into account the hidden part. 7.4. Connection with the reduced form The instrumental variable technique is very intimately connected with the method of applying least squares to the reduced form. Assume for the moment that $ is the only exogenous variable affecting the economy and that the complete model is I P — Pq = u 1 L 1 (7-13) — P t — *q — z — v whose first equation is manifest (the solid arrows in Fig. 14). The second equation is latent and corresponds to the broken arrows. : Exercise 7. A Explain why the complete model cannot contain three inde- pendent equations, one for p «;-» q, one for z --* p, and one for z ••* q. The reduced form is I Dp == 0z H u + @v \* (7-14) Dq =■ z tt+ v I ?* i where D is the determinant I/72 + 18/72. If the second equation in (7-13) contained another exogenous variable, say, 2', then the first iequation would be overidentified. This fact would be reflected in the' instrumental variable technique as the 114 INSTRUMENTAL VARIABLES following dilemma: Should we use z or z' as the instrumental variable? On this last point I have more to say in Sees. 7.5 to 7.7. 7.5. Properties of the instrumental variable technique in the simplest case The technique is biased and consistent, where naive least squares is biased and inconsistent. Bm*!s-fi + 2z> m fi + 1 _J22 — _ — (7-15) rn tq m, q r Z) ra„ - (l/yi)m gu + m zv v ' ^ g , _1 ^iu — (1/Tl)^uu + Wvu D m it - (2/yi)m tu + 2m tv + (l/y\)m uu - (2/7i)m rtt + m vv (7-16) Under the ordinary Simplifying Assumptions, <r tu = <r„ = ow = 0; so, for large samples, the first expression approaches 0, and the second approaches Expression (7-15) gives us additional guidance for selecting an instrumental variable. To minimize bias, the following conditions should be fulfilled, either singly or in combination: 1. -m att numerically small 2. m„ numerically large 3. D numerically large The first condition says that z should be truly exogenous to the sugar market. Appropriations for school lunches are better in this respect than the tax on sugar, because the tax might have been imposed to discourage consumption or to maximize revenue, in which case it would have some connection with the parameters and variables of the sugar market. The second condition says that, in the sample, the instrumental variable must have varied a lot: if the tax on sugar varied only trivially it had no opportunity to affect p and q significantly enough for us to 7.6. EXTENSIONS i 115 capture /3 by our estimating procedure. From this point of view the tax on sugar might be less desirable as an instrumental variable than some remote but more volatile? entity, say, the budget's appropriations for U.S. Information Service, j The third condition says that D = l/? 2 + 0/7 1 should be numeri- cally large; that is, that 71 and 72 should be numerically small relative to (3. This says that, to minimize bias, p and q should react more strongly to each other (in the manifest model) than to the instrumental variable in the latent model, j It requires that price and quantity be more sensitive to each other in the sugar market than to such things as the U.S.I.S. budget, the tax on honey, or, for that matter, the tax on sugar itself. It is not easy to find an instrumental variable fulfilling all conditions at once. However, if the sample is large, any sort of. instrumental variable gives better estimates than least squares. 7.6. Extensions 1 ! The instrumental variable; technique can be extended in several directions: I 1. The single-equation incomplete manifest model may con- tain several parameters to be estimated. For example, the model V = 0i<Z + $iV + u requires two instrumental variables z\ and z*. The estimating formulas are analogous to (7-9) : S = m <'i''«>|iMrt g __ m (*..« t )(g.p) (7-17) All the criteria of Sec. 7.5 fo:: selecting a good instrumental variable are still valid, plus the following* z\ and z 2 must really be different variables, that is, variables not well correlated in the sample; else the denominators approach zero, and the estimates &, £ 2 blow up. Exercise 7.B If we wish to estimate the parameters of p — fiq + yz + u, where z is exogenous, is it permissible for z itself to be one of the instrumental variables z\ and 2 2 ? 2. The incomplete manifest model may consist of several equations, 116 INSTRUMENTAL VARIABLES for instance: 021? 4" ? + 72« = W 2 Each equation can be estimated independently of the other, using formulas analogous to (7-17). Variable z itself and another variable z x may be used as instrumental variables in both equations, or two variables %\ % z 2 completely extraneous to the manifest model may be used. 7.7. How to select instrumental variables In some instances we may have several candidates for the role of instrumental variable. The choice is made anew for each equation of the manifest model, and the rules are: 1. If several instrumental variables are needed, they should be those least correlated with one another. 2. The instrumental variables should affect strongly as many as possible of the variables present in the equation that is being estimated. Choosing instrumental variables is admittedly arbitrary. Another statistician with the same data might make a different choice and so get different results for the same model. The technique of weighting instrumental variables eliminates some of this arbitrariness. I illustrate the technique for the single-equation, single-parameter demand model p = @q + u. Suppose that two exogenous variables are available, z i (the sugar tax) and z<i (the tax on honey), and that both affect p and q. To select z\ or z 2 is arbitrary. The new variable z = Wiz x + w 2 Z2, a linear combination of the two taxes with arbitrary weights W\ and w 2 , is less arbitrary because both taxes are taken into account. 1 Results improve considerably if we take W\ } w 2 proportional to the importance of the two taxes on the sugar market. Naturally, to estimate the parameters of the sugar market, the weight given the sugar tax should be greater than that given to the tax on honey; and vice versa when we want to study the honey market. In general, we ought to rank the instrumental variable candidates z h z 2 , Zz, * . . in order of increasing remoteness from the sector being estimated and 1 This treatment with w x « u> 2 » 1 coincides with Theil's method with k m 1. Consult Chap. 9 below. 7.7. HOW TO SELECT INSTRUMENTAL VARIABLES 117 assign them decreasing weights in a new instrumental variable z = w&i + w&2 + w&z +••'.•• The more accurate the a priori information by means of whioh weights are assigned, the more does this technique approximate 5he results of the full information maxi- mum likelihood method, discussed in Chap. 8. Exercises Warning: These are difficult! 7.C Prove or disprove the conjecture that weighted instrumental variables are "better" than unweighted. Use the model p = Pq -f u, (l/yi)p + (I/72)? — to — $i;Z 2 — v, where y h y it 6 h 5 2 measure the sensitivity of p and q to Z\ and z 2 . Define W\ and tt? 2 as 6i/(5i -f 5 2 ). Define m *tQ ftliwiZi+Wttj-q Prove e{/Sf(*i) - Hfri)]) 1 > «WW - ttfMll 1 < «WW - «(£<*«>] I f - 7.D Prove or disprove the conjecture that the goodness of the weighted instrumental variably technique is insensitive to small depar- tures of W\ t w t ^om their ideal relative sizes 5i/(5i + 62), 62/(61 + 5 2 ). CHAPTER 8 Limited information 8tl« Introduction Limited information maximum likelihood is one of the many tech- niques available for estimating an identified (exactly or overidentified) equation. Other methods are (1) na'ive least squares, (2) least squares applied to the reduced form, (3) instrumental variables, (4) weighted instrumental variables, (5) TheiFs method, 1 and (6) full information. Method 1 is biased and inconsistent; the rest are biased and con- sistent. They are listed in order of increasing efficiency. Limited information leads to burdensome computations, but is less cumbersome than full information. Unlike full information but like all other methods, limited information can bo used on one equation of a model at a time. Limited information differs from the method of instru- mental variables in two ways: it makes use of all, not an arbitrary selection, of the exogenous variables affecting the system; it prescribes a special way of using the exogenous variables. If an equation is exactly identified, limited information and instrumental variables are equivalent methods. Like all methods of estimating parameters, limited information uses formulas that are nothing more than a 1 Discussed in Chap. 9. 118 8.2. THE CHAIN OF CAUSATION 119 glorified version of the quotient Change in effect (s) Corresponding change in cause(s) I shall illustrate by the example of (8-1), where the first equation ii l@ be estimated by the limited information method. The rest of !h§ model may be either latent or manifest. The limited information method ignores part of what goes on in the remaining equations by deliberate choice, not because they are latent (though, of course, they might be). However, in (8-1) the entire model is spelled out for pedagogic reasons. The minus signs are contrary to the notations! conventions used so far but ssre very handy in solving for ff\, i/s» y%» Nothing in the logic of the situation is changed by expressing B and T in negative terms. 2/i - ft/2 - 7*1 - Wi 2/2 - 0232/3 - 72^2 - 723^3 - 78*24 * «2 (84) ~ 0312/1 -f 1/3 - 731*1 - 734^4 « U% As usual, a single asterisk distinguishes the variables admitted in the first equation and a double asterisk those excluded from it. Thus y* = vec (2/1,2/2), z* = vec (zj, y** ■■ vec (1/3), z** - vec (zi,fi|f«).. We apply the limited information method in two cases: 1. When nothing more is known about the economy than that some- how z 2 , Z3, Zi affect it 2. When more is known bub this is purposely ignored 8.2. The chain of causation Let the first equation be the> one we wish to estimate, out of a model containing several. The chains of causation in a general model of several equations in several unknowns are shown in Fig. 15. The arrow- heads show that causation flows from the z's to the y'a but not baok> and mutually between the y's. Solid arrows correspond to the ?&$% equation, broken arrows to the rest of the model. The two left-hand rounded arrows, one solid, one broken, show that the y*'a (the endoge- nous variables admitted in the manifest model) interact both in the first equation and (possibly, tDo) in the rest of the model. The right- 120 LIMITED INFORMATION Fig. 15. Causation in econometric models. Fig. 16. Chain of causation in the special model (8-1). hand rounded arrow shows that the y**'s (the endogenous variables excluded) interact, but, naturally, only in the rest of the model. Parenthetically, the crinkly arrows symbolize intercorrelation among the exogenous variables. Ideally, the exogenous variables are uncon- nected, but in any given sample they may happen to be intercorrelated. This is the familiar problem of multicollinearity (Sec. 6.14) in its 8.3. THE RATIONALE OF LIMITED INFORMATION 121 general form. The stronger the; correlation between one z and another, the less reliable are estimates of the 7s because different exogenous variables have acted alike in ithe sample period. We shall ignore multicollinearity and continue with the main subject. Figure 16 shows the chains of causation in (8-1). The variable %% affects ?/i and t/ 2 in the first equation and y\ and yz in the third. Variables z 2 , 23, zk affect t/ 2 and v/ 3 in the second, and z 4 affects 7/1 and ijt in the third. We can make the arrows between the y's single rather than double-headed because there are as many equations as there are endogenous variables. Thus tlie model can be put into cyclical form. 1 It so happens that (8-1) is already in this form; that is, given a constellation of values for the exogenous variables and the random disturbances, if we give yz an arbitrary value, then y 3 determines y%^ which in turn determines y h which in turn affects y z , and so round and round until mutually compatible values are reached. 8.3. The rationale of limited information 1 The problem of estimating thte model of Fig. 16 can be likened to the following problem. Suppose that z h z 2 , zz, zk are the locations of four springs of water and that ?/i, y%, 1/3 are, respectively, the kitchen tap, bathtub tap, and shower tap of k given house. The arrows are pipes or presumed pipes. Estimating the first equation is like trying to find the width of the pipes between z\ and yi and y% and of the pipe between 2/ 2 and y\. The width is estimated by varying the flows at the four springs z\, z 2 , 23, zk and then measuring the resulting flow in the kitchen (2/1), bathtub (y 2 ), and shower (2/3). Limited information attempt! to solve the same problem with the following handicaps arising either from lack of knowledge or from deliberate neglect of knowledge: 1. Pipes are known to exist for certain only where there &r© £olid arrows (7, j3). 2. It is known that z 2 , Zz, z* enter the flow somewhere or other ? but it is not known where. 3. It is not known whether there is another direct pipeline (*y«) from z\ to the kitchen (2/1) and bathtub (?/ 2 ). 4. The flow at the shower (2/3) is ignored even if it is measurable. 1 Note carefully that the cyclical and the recursive are different forms. 122 LIMITED INFORMATION So as not to fill up page upon page with dull arithmetic, I am going to cut model (8-1) drastically by some special assumptions, which, I vouch, remove nothing essential from the problem. The special assumptions are: £ = l.G, 7 = 0.5, 731 — 734 = 0, 1831 = 0.1, /3 2 3 = 2; 72222 -f 723Z3 + 72424 is combined into one simple term = 722**; 72 = 0.5. Then (8-1) collapses to t/i -- Qy 3 - yzX = Mi 2/2 - £232/3 - 722** = u 2 (8-2) - 0312/1 +2/3 = M 3 and Fig. 16 collapses to Fig. 17. Now let us change metaphors. Fig. 17. Another special case. Instead of a hydraulic system, think of a telephone network. The coefficients 0, 7, if greater than 1, represent loud-speakers; if less than 1 , low-speakers. Where a coefficient is equal to 1, sound is transmitted exactly. To avoid having to reconcile conflicting observations, assume that all the disturbances are zero, i.e., that there is neither leakage of sound out of nor noise into the acoustic system of Fig. 17. Here is how the estimating procedure works. Begin from a state of acoustical equilibrium, and measure the noise level at each point of the network. Then step up the sound level at z** by 100 units. Only 50 of these reach location 2/2, because there is a twofold low-speaker (72 = 0.5) between z** and 2/2. Also step up the sound level at z* by, say, 10 units. Only 5 units (7 = 0.5) get to 2/1. But, whatever extra 8.4. FORMULAS FOR LIMITED INFORMATION 123 noise there is at y\, one-tenth bi it (08i = 0.1) reaches y%. From y% a loud-speaker doubles the increment as it conveys it to y%, whence some gets to Vh and so on. By differencing (8-2) and solving for Aw ■ 0, Az* = 10, and Az** = 100, the ultimate increments are found to be A?/i = 125, At/2 = 75, Ay z — 12.5. Now, suppose we did not know how strong was the low-speaker connection /3 between t/i and y* By differencing (8-2), we get j * ' Ayi- 7 As* . . When the model is exact, it takes exactly five observations to determine 0, 7, 72, fe, 03i. When the model is stochastic, there are complica- tions, but the basic appearance of the formula is not much different. The numerator can be interpreted as that change in the sound level y\ not attributable to what is coming over the line from z*, that is to say, only the sound that comes from y 2 and ?/ 3 . The denominator measures the increment at y-i resulting from two sources z** and y*. The limited information method just ignores the latter source entirely. This is so because both /3 2 3 and 3 i belong to the "rest of the model" and are neither specified nor evaluated. So, (8-3) is interpreted as follows: _ variation ;.n y* not due to any z* , R .. variation in y\ from all sources The limited information method suppresses /? 2 3 and /3 3 i and estimates 0by __ A?/i — 7 Az* _ variation in y\ not due to z* ,~ -. 72 Az** ~ variation in y* due to z** Notice carefully that the method suppresses only 2 3, 03i, that is, the latent model's intervariation of the endogenous variables. It does not suppress 72, i.e., the variation (in the latent model) due to the exogenous variables z**. I 8.4. Formulas for limited information This section shows that the lengthy formulas for computing limited information estimates are just fancy versions of (8-3). It can safely be skipped, for it contains no new ideas. To obtain estimates of the D 124 LIMITED INFORMATION #s of the first equation, combine the moments of the variables as in the following list: In general In model (8-1) 1. Construct C m m y ««(nO- 1 m 1! y. *n(vLV,Kn,$t .*..««) ' {n*(«i,«,.t,.i4) 2 )- 1 2. Construct D - m y vtm BV }- , m a v ™ ,,,,, • (m,,,,)" 1 • (m flVl m tlVt ) 3. Construct W SS **l y » y « ~ C 4. Compute 1 V = < 5. Compute 2 Q _ V- J W 6. The estimate of the /9s of the first equation Is a nontrivial solution (called the eigenvector) of Q. 7. Having computed (J, one can calculate f , u, and estimates of the covariances of the disturbances and the parameter estimates. In steps 1 and 2 above, the factors m z « z « and m za5 and also the z and z* in the remaining moment matrices play a role analogous to the weights cj/2c? in the least squares technique. 8 They just provide a method for reconciling the conflicting observations generated by the nonzero random disturbances. The matrix m y . y . corresponds to the pair of round arrows about y* in Fig. 15. Essentially, Q is an estimate of the 0s of the first equation. Q can be interpreted as a quotient, because the matrix operation V -1 W reminds one of the ratio of two numbers: W/V. Actually, this impressionistic intuition is quite correct. W corresponds to an elaborate case of Ae 1 Klein calls this B instead of V. I use V to avoid confusion with the B of By + Tz — u. 2 Klein calls this A. I use Q to avoid confusion with the A of the model Ax « u. 3 Compare with Sec. 7.2. 8.6. CONNECTION WITH INDIRECT LEAST SQUARES 125 (the change in the effects), and V is an elaborate case of Ac (the cor- responding change in the causes). Indeed W and V are complicated cases of the numerator and denominator of (8-5). W is interpreted as the variation of the endogenous variables not due to any exogenous changes, and V expresses the variation of the endogenous variables from all sources exogenous to any equation of the model and endoge- nous to the manifest part. 8.5. Connection with the instrumental variable method Limited information recognizes that exogenous influences not present in the first equation influence the course of events. The instrumental variable method acknowledges the same thing. Limited information makes use of all these exogenous influences, whereas the instrumental variable method (generally) picks from among them, either hap- hazardly or according to the principles of Sec. 7.7. When the first equation is exactly identified, picking is impossible and the two methods coincide. 8.6. Connection with indirect least squares The limited information method can also be interpreted as a form of modified indirect least squares or as a generalization of directional least squares (see the Digression in Sec. 4.4). The direct or naive least squares method estimates /? essentially as the regression coefficient of y\ on 2/2. Haavelmo's proposition (Chap. 4) advised us to minimize square residuals in the northeast-southwest direction in order to allow for autonomous variations in the exogenous variable, investment z t . In (8-1) there are several such exogenous variables z h z 2 , «3, za which generate in the 2/12/2 plane a scatter diagram which is a weighted average of lozenge -shaped figures (as in Fig. 9), one for zi, one for z it and so on. In matrix C (and, hence, in W and V) this weighted averaging has taken place. Further readings Hood, chap. 10, describes in detail how to compute limited information and other types of estimates, and illustrates with a completely worked out macroeconomic model of Klein's. CHAPTER 9 The family of simultaneous estimating techniques 9.1. Introduction We owe to Theil 1 a theorem showing that all the estimating tech- niques of Chaps. 4 to 8 are special cases of a new technique, which has the further merit of being fairly easy to compute. Section 9.2, which covers this ground, is addressed primarily to lovers of mathematical generality and elegance; other readers might skip or skim. The other sections of this chapter reconsider underidentification and overidentification from the point of view of research strategy. Section 9.3 accepts models as given (over-, under-, or exactly identified) and suggests alternative treatments. Section 9.4 raises the issue of whether econometric models can be anything but underidentified. 9*2. Theil's method of dual reduced forms This method can be applied to all equations of a system, one at a time. The equation we want to estimate, called the "first" equation, 1 Reference in Further Readings at the end of this chapter. 126 9.2. theil's method op dual reduced forms 127 comes from a complete system, for instance, (8-1). We know and can observe all the exogenous variables affecting the system, and we also know a priori which variables (endogenous and exogenous) enter the first equation. The other equations may be identified or not. The disturbances have the usual Simplifying Properties. Any endogenous variable of the first equation can be chosen to play the role of dependent variable. We shall use 2/1 in this role. The remaining variables of the first equation, namely, 2/2, . . . , ya*\ Z\ % . . . , z//», must all be differ- ent in the sample; that is to say, they must not behave as if they were linear combinations of one another. We do not need to know or observe the endogenous variables 2/g*+i, . . . , ya not present in the first equation. Let one star, as usual, represent presence in the first equation, and two stars, absence from the first equation. We then form two reduced forms whose coefficients we calculate by simple least squares: (1) y* on z* with parameters f and residuals v; and (2) y* on z = (z*,z**) with parameters p and residuals w. For instance, to estimate the first equation of (8-1), compute 2/i = tfnZi + h 2/1 - Pn*i + P12Z2 + P13Z3 + P14Z4 + Wi 2/2 = #2iZi -f v% 2/2 = P21Z1 + P22Z2 4- P23Z3 + P24Z4 + Wt (9-1) The right-hand set in (9-1) is necessary for estimating the first and useful for estimating the other equations of (8-1). Let us omit the bird ( v ) where it is obvious. Next, we compute the moments of the residuals on one another and construct two new matrices D(fc) and N(&): U\k) sb m.( v ,vo*;*» 'b*)(vv ...V0s*i «aO ; — fcn\( W \vdq+, 0, . . . , 0)(wj wo*; c. . » , . o.» JM^/cy = m^, vo*\'\ *a«)*i/j j A?lIi( W| wq*\ 0)'U>! where A; is a variable that will be defined below. Then the estimates of the j8s and 7s of the first equation are given by est fa . . . ,/3 G *,7i| . . . ,y H *) - [D(*)]-W(*) (9-2) Theil has proved that, if Jc = 0, then (9-2) gives the naive least squares estimate with 2/1 treated as the sole dependent variable. If h *■ 1, (9-2) gives the method of unweighted instrumental variables 128 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES of Sec. 7.7. If k = 1 -f- v, where v is the smallest root of del [m( t , »o»)(»i.... »<?•) — (1 + ^tao^i wo*)(u>i w *)l = (9-3) then the estimates of (9-2) are identical with the limited information estimates of Chap. 8. All these estimates except for the k = case are consistent, but biased in the /3s. In the case k — 1, the bias itself can be estimated and corrected for. 1 These findings not only are exciting for their beauty and symmetry, but are practical as well. The regressions (9-1) are straightforward and attainable by simple calculation (see Appendix B) even for large systems. The solution of (9-3) is not too hard, since the number G* of present endogenous variables seldom exceeds 3 or 4 in any actual models. But (9-3) must be calculated over again if we decide to estimate the second or third equation of the original model. Theil states that his technique works if the remaining equations of the system are nonlinear and that it works for large samples even when some of the z's are lagged values of some y. 9.3. Treatment of models that are not exactly identified This section gives advice on how to treat models that in their natural state contain some underidentified or some overidentified equations, or both. The alternatives are listed from the most desirable to the least desirable, disregarding the cost of computation. If a model contains some underidentified equations, we need do nothing about them unless we wish to estimate them. The remaining equations, if identified, can be estimated in any case. If we wish to estimate the underidentified equation, we must make certain alterations: 1. Make it identified by bringing in parameter estimates from independent sources, say, cross-section data. There are pitfalls of a new kind in this method, however, which are noted briefly in Chap. 12. 2. Identify the equation in question by strategically adding variables elsewhere in the model. This process, however, might de-identify the rest of the model. 3. Go ahead and estimate the underidentified equation; then, if you have a priori information on covariances, perform the tests of Sec. G.9 1 Compare with Appendix B. 9.3. TREATMENT OF MODELS NOT EXACTLY IDENTIFIED 129 to detect (or try to detect) whether you have estimated a bogus i i function. If, on the other hand, the model contains some overidentified equations: 1. Use the full information, maximum likelihood method. This will yield consistent and efficient estimates of the identifiable parameters. 2. Use the limited information, maximum likelihood method, 3. Use instrumental variables, weighted. 4. Use instrumental variables, unweighted. 5. In the given equations, add variables where they are most relevant in such a way as to remove the overidentification. 6. Enlarge the system by endogenizing a previously exogenous variable. 7. In the original overidentified model, remove the overidentification by introducing redundant variables in the other equations. If it turns out that the redundant variable has a significant parameter, you have succeeded. 8. Drop variables to remove the overidentification. Instead of outright dropping, you may linearly combine two or more such variables. This cannot always be done, because the combined variables are not always present together or absent together elsewhere in the model. 9. Use the reduced form, and select arbitrarily one of the several sets of alternative estimates. Underidentification is a more serious handicap than overidentifica- tion. To remove the former you have to make material alteration! in the model. To remove the latter you can always use the full informa- tion method. Whatever the final alterations, I would begin by constructing my models without worrying about identification. In doing so, I sm sure that I am acting in the light of my best a priori wisdom, givcm the objectives of my study and my computing budget. If it turns out that identification makes alterations necessary, I think that honesty requires me to keep a record; of the identifying alterations. Like Ariadne's thread, this record keeps track of my search for a second best; I may want to give up in frustration and return to try another way out of the Minotaur's chamber. 130 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES 9.4. The "natural state" of an econometric model Econometricians have devoted a good deal of attention to over- identified models. This entire book, from Chap. 6 on, is devoted to developing various approximations 1 to the full information method, which everybody tries to avoid because of its burdensome arithmetic. According to Liu, 2 we have been wasting our effort, because all well- conceived econometric models are in truth necessarily underidentified: In economic reality, there are so many variables which have an important influence on the dependent variable in any structural equation that all structural relationships are likely to be "underidentified." So Liu would not use any of our elaborate techniques, but would estimate just the reduced form and do so by simple least squares. The reduced form is to include as many exogenous variables as our knowledge and computational patience permit. Liu would then use these estimates for forecasting, and claims that they forecast better than all other techniques. These subversive ideas deserve careful consideration. Is it true that structural equations in their natural, unemasculated, noble- savage state are underidentified? If they are, in what sense are forecasts from the reduced form better? To begin with, there are occasions in which the investigator does not care to know the; values of the structural parameters and is content with some kind of reduced form. To illustrate one occasion of this sort, assume that the investigator 1. Works from a typical and large enough sample 2. Forecasts for an economy of fixed structure 3. Forecasts from exogenous variables that stay in their sample ranges Under the above conditions, an investigator would be glad to work with a ready-made reduced form though not necessarily with parame- ters estimated by simple least squares. He would accept the latter if justifiable, not for want of anything better. 1 Unweighted and weighted instrumental variables and limited information. * Ta-Chung Liu, "A Simple Forecasting Model for the U.S. Economy," p. 437 (International Monetary Fund Staff Papers, pp. 434-466, August, 1955). 9.4. THE "NATURAL STATE OF AN ECONOMETRIC MODEL 131 Are econometric models necessarily underidentified? Admittedly, it is an oversimplification, as Liu states, 1 to impose the condition that certain variables be absent from a given structural equation. But it is gross "overcomplification" — to coin a much-needed word—to impose no condition at all, inviting into the demand for left-handed, square- headed J/^-inch bolts (and on equal a priori standing with the price of steel) the average diameter of tallow candles and the failure or success of the cod catch off the banks of Newfoundland. My instinct advises me to go halfway concerning these new variables: neither leave them out altogether nor admit them as equals. Consider the model q + ap + yr + 5/= u $q+ p =v C?M) consisting of one underidentified and one overidentificd equation. Now, if r and / are admitted as equals in the second equation, with parameters of their own, the whole system becomes underidentified. But the very knowledge that first convinced us to leave them out of the second equation now advises us to tack them on with a priori small parameters, small relative to 0, ?, etc. A reasonable restatement might be the following: q + ap + yr + 8f - u (3q+ p+j(3r + k8f~v v *"°' where j, h are small constant:?, say Kooo> Koo> or some other not unreasonable value. And now (wonder of wonders!) both equations have become identified. The trick does not always work. For instance, it does not help in pq -r p = w 2 to fill the hole with kar, nor k(3r, nor kyr because we still have three parameters (a, 0, 7) to estimate and the reduced form contains only two coefficients 7F1 = m qr /m rr , ih — m pr /mrr. However, if the supply of exogenous variables is less niggardly than in (9-6) it is not hard to find reasonable ways to complete a model so as to identify it in it3 entirety, if we so desire. The most difficult and dangerous step is the assigning of values to 1 Ibid., p. 405. ; 132 THE FAMILY OP SIMULTANEOUS ESTIMATING TECHNIQUES j and k. The values must have the correct algebraic sign; otherwise, structural parameters are wildly misestimated. If the correct magni- tudes forj and k are unknown, it is better to err on the small side than on the large. Too small (positive or negative) a value of j is better than a hole in the equation, but too large a value may be worse than a hole. 9.5. What are good forecasts? If we want to forecast from an underidentified model, we have no choice but to use some kind of reduced form; from an overidentified model, it is convenient, not compulsory, to work from a reduced form. The entire question in both cases is: What sort of reduced form? How ought we to compute its coefficients? To pin down our ideas, we shall consider the model By -f Tz = u, where u has all the Simplifying Properties; in addition we shall make the covariances a UgUhi known fixed constants, possibly all equal, so as to keep them out of the way of the likelihood function. This way we concentrate attention on the structural parameters /?, 7, and x and their rival estimates. The reduced form is y ** riz + v, where n = — B _1 r, v = B _1 u. The reduced form contains the entire set of exogenous variables whether the original form is exactly, over-, or underidentified. Maximum likelihood minimizes 22* by the jSs and 7s; limited information and instrumental variables approximate this. The naive reduced form advocated by Liu minimizes 22* < by the ts (whatever these may be). Naturally, the two procedures are not equivalent, and, naturally, the second guarantees that residuals will be forecast with minimum variance. 1 But what is so good about forecasting residuals with minimum variance? The forecasts themselves 1 Provided the sample and structure conform to conditions 1 to 3 of Sec. 9.4. 9.5. WHAT ARE GOOD FORECASTS? 133 in both cases are (in general) biased, but the forecasts by maximum likelihood have the greater probability of being right. In Fig. 18, p is the course of future events if no disturbances occur. The curve labeled p shows the (biased) probability distribution of the full information, maximum likelihood estimate of p; it is in general biased (ep s* p) but has its puak at p itself. Curve p m another maximum likelihood estimate (.say, instrumental variables or limited information) ; it too has a peak at p but a lower one, perhaps a different bias £p, and certainly a larger variance than p. The reduced-form least squares estimate is distributed as in curve p; naturally it has a Ep Ep Variat le and its forecasts Fig. 18. The properties of forecasts, p: the true value of the forecast variable under zero disturbances, p: reduced-form least squares estimates, p: full- information maximum likelihood estimates, p: other maximum likelihood estimates. ; smaller spread than p and p; it may be more or less biased than either; but its peak is off p. To put this into words: If, in the postsample year, all disturbances happen to be zero, maximum likelihood estimates forecast perfectly, and least squares forecast impc rfectly. If the disturbances are non- zero, both forecast imperfectly; but, on the average and in the long run, least squares forecasts are less dispersed around their (biased) mean. Which criterion is more reasonable is, I think, open to debate. I favor maximum likelihood estimates for much the same reason that I accept the maximum likelihood criterion in the first place: If we arc to predict the future course of events, why not predict that the most probable thing (u = 0) will happen? What else can we sanely 134 THE FAMILY OF SIMULTANEOUS ESTIMATING TECHNIQUES assume — the second most probable? On the other hand, if my job depends on the average success of my forecasts, I shall choose the least biased technique and disregard the highest probability of particu- lar instances. If I want to make a showing of unswerving, unvacil- lating steadfastness, I shall use the least squares technique on the reduced form, even though it steadfastly throws my forecasts off the mark in each particular instance and in the totality of instances. Further readings The reference for Sec. 9.2 is H. Theil, "Estimation of Parameters of Econo- metric Models" {Bulletin de Vinslitut international de statistique, vol. 34, pt. 2, pp. 122-120, 1954). It is full of misprints. Extraneous estimators are illustrated in Klein, chap. 5, where ho pools time-series and cross-section data. Their statistical and common-sense diffi- culties are discussed in Edwin Kuh and John R. Meyer, "How Extraneous Are Extraneous Estimates?" (Review of Economics and Statistics, vol. 39, no. 4, pp. 380-393, November, 1957). Tinbergen, pp. 200-204, discusses the advantages and disadvantages of working from a reduced form, but overlooks that its least squares estimation is maximum likelihood only for an underidentified or exactly identified system. Ever since Haavelmo, Koopmans, and others proposed elaborate methods for correct simultaneous estimation, naive and not-so-naive least squares has not lacked ardent defenders. Carl F. Christ, "Aggregate Econometric Models" [American Economic Review, vol. 46, no. 3, pp. 385-408 (especially in pp. 397-401), June, 19.56], claims that least squares forecasts are likely to be more clustered than other forecasts; and Karl A. Fox, "Econometric Models of the U.S. Economy" (Journal of Political Economy, vol. 64, no. 2, pp. 128- 142, April, 195G), has performed simple least squares regressions using the data and form of the Klcin-Goldberger model (for reference, see Further Readings, chap. 1). See also Carl F. Christ, "A Test of an Econometric Model of the United States 1921-1947" (Universities-National Bureau Com- mittee, Conference on Business Cycles, New York, pp. 35-107, 1951), with comments by Milton Friedman, Lawrence R. Klein, Geoffrey H. Moore, and Jan Tinbergen and a reply by Christ, pp. 107-129. In pp. 45-50 Christ summarizes the properties of rival estimating procedures. E. G. Bennion, in "The Cowles Commission's 'Simultaneous Equations Approach': A Sim- plified Explanation" (Review of Economics and Statistics, vol. 34, no. 1, pp. 49-56, 1952), illustrates why least squares gives a better historical relation- ship and better forecasts (as long as exogenous variables stay in their historical range) than do simultaneous estimates. John R. Meyer and Henry Laurence Miller, Jr., "Some Comments on the 'Simultaneous-equation FURTHER READINGS 135 Approach* " (Review of Economics and Statistics, vol. 36, no. 1, February, 1954), state very clearly the different kinds of situations in which forecasts have to be made — and to each corresponds a proper estimating procedure. Herman Wold says that he wrote Demand Analysis (New York: John Wiley & Sons, Inc., 1953) in large part to reinstate "a good many methods which have sometimes been declared obsolete, like the least squares regression or the short-cut of consumer units in the analysis of family budget data" and to "reveal and take advantage of the wealth of experience and common sense that is embodied in the familiar procedures of the traditional methods" (from page x of the preface). He believes that the economy is in truth recursive and that it can be described by recursive models whose equations, in the proper sequence, can be estimated by least squares. His second chapter, entitled "Least Squares under Delate" (especially sees. 7 to 9), is very far from convincing me that ho is right. CHAPTER 10 Searching for hypotheses and testing them 10.1# Introduction Crudely stated, the subject of this chapter is how to tell whether some variables of a given set vary together or not and which ones do so more than others. The problem is how to make three interrelated choices: (1) a choice among the variables available, (2) a choice among the different ways they can vary together, and (3) a choice among different criteria for measuring the togetherness of their variation. The whole thing is like a complicated referendum for simultaneously (1) choosing the number and identity of the delegates, (2) deciding whether they should sit in a unicameral or multicameral legislature, and (3) supplying them with rules of procedure to use when they go into session. This triple task is too much for a statistician, as it is for a citizenry: it wastes statistical data, as it wastes voters' time and attention. Just as, in practice, people settle independently, arbitrarily, and at a prior stage the number of chambers, the number of delegates, and the rules of procedure, so the statistician uses maintained hypotheses. 136 10 2. DISCONTINUOUS HYPOTHESES 137 I For example, in the model C< = ja + yZ t + w« of Chap. 1, the presence of one and not two equations,! two and not four variables, all the remaining stochastic and structural assumptions, and the requirement for maximizing likelihood are the maintained hypotheses. Only rival hypotheses about the true parameter values a and 7 remain to be tested. The entire field of hypothesis searching and testing consists of variations on the above theme.' The maintained hypotheses can be made more or less liberal, or they may change roles with the questioned hypotheses. Section 10.4 lists many specific examples. The general moral of this chapter is this: Having used your data to accept or reject a hypothesis while maintaining others, you are not supposed to turn around, maintain your first decision, and test another hypothesis with the same data.i If you are interested in testing two hypotheses from the same set c'f data, you must test them together. Thus, if you want to find both the form and personnel of government preferred by the French, you should ask them to rank on the ballot all combinations (like Gaillard/unicameral, Gaillard/bicameral, Pinay/ unicameral, Pinay/bicameral) and to decide simultaneously who is to lead and which type of parliament; not the man first and the type second; not the type first and the man second. Everything that follows in this chapter pretends that variables are measured without error. Sections 10.2 and 10.3 introduce two new concepts: discontinuous hypotheses and the null hypothesis. Sections 10.4 to 10.8 explore some of the! commonest hypotheses considered by econometricians, especially wheii they set about to specify a model. 10.2. Discontinuous hypotheses Consider again the simple model C t = a + yZ t 4- u t . The rival hypotheses here are alternative Values of ct and y and may be any pair of real numbers. This is an example of continuity. Now consider this problem: ]|3oes x depend on 1/, or the other way around? Taking the dependence (for simplicity only) to be linear and homogeneous, the rival hypotheses here are x t = yyt + u t versus y t — Bx t + v t The answer is yes or no; either the first or the second equation holds* This is an example of discontinuity. However, tne further problem of 138 SEARCHING FOR HYPOTHESES AND TESTING THEM the size of y (or 5), by itself, may be a continuous hypothesis problem. Many of my examples below (Sec. 10.4) are discontinuous. The simple maximizing rules of the calculus do not work when there is discontinuity, and this fact makes it very interesting. 10.3. The null hypothesis In selecting among hypotheses we can proceed in two ways: (1) compare them all to one another; (2) compare them each to a special, simple one, called the null hypothesis (symbolized by H ). An example of the first procedure is the maximum likelihood estimation of (#,7) in the model C t = a + yZ t + u t , since it compares all conceivable pairs (a,y) in choosing the most likely among them. The other way to proceed is somewhat as follows: select a null hypothesis, for example, a = 3 and y = 0.7, and accept or reject it (i.e., accept the proposition 11 either a j^ 3 or 7 ^ 0.7, or both") from evidence in the sample. I have more to say later on about how to select a null hypothesis and what criteria to use for accepting or rejecting it. Meanwhile, note that the decision to proceed via null hypothesis has nothing to do with continuity and discontinuity, though it happens that many applica- tions of the null hypothesis technique are in discontinuous problems. 10.4. Examples of rival hypotheses Many of the examples in this section are linear and homogeneous for the sake of simplicity only ; in these cases linearity (and homogeneity) is guaranteed not to affect the principle discussed. In other examples, however, linearity (or homogeneity) is a rival hypothesis ana thus very much involved in the principle discussed. Now to the examples: 1. Which one variable from a given set of explanatory variables is best? For instance, should we put income, past income, past con- sumption, or age in a rudimentary consumption function? The rival hypotheses here are C t = 0Y, + u t C t = yY t -i + u t C t '= tCt-i + u t etc. 2. Should the single term be linear or quadratic, logarithmic, etc.? The rival hypotheses here are 10.4. EXAMPLES OF RIVAL HYPOTHESES 139 & = 0Y t + u t C t = yY* f u t C t - 6 log Y t + w« etc. Note that this becomes a special case of example 1 if we agree that 7 2 , log Y, etc., are different variables from Y (Sec. 10.9). 3. What value of the single parameter is best? In C t = &Y t + u t the rival hypotheses are different values of /?, say, £ = 1, /3 = H, jg = %, and others. This, too, is a special case of example 1, because it can be expressed as a choice among the explanatory variables 7, 27, 47/3, respectively. 4. Should there be one or more equations in the model? This ques- tion, important when several variables are involved, lurks behind the problems of confluence (see Sec. 6.14), but it arises even with two variables. I i | The above examples generalize, naturally. For instance, the question may be which two or which three variables to include, which linearly, which nonlinearly, how many lags, and how far back. 5. Which variables are to be regressed on which? The rival hypotheses are Xi = ax 2 + u versus x 2 « 0xi 4* v for two variables. If we maintain the hypothesis of three variables in a single equation, the rival hypotheses are Xi = aXi -f- 0x 3 + u versus x 2 ■ 7x1 + 8x 3 + v versus x 3 « %X\ + f$§ 4- w And, if we maintain three variables and two equations, i\w rival hypotheses become xi = ax% + px s 4- u xi = €X 2 4- fx 3 + w 1 » i versus , n . , X2 « t^i 4- 5x 3 4- v X3 = rjxi 4- 6x2 4- 1 j x a « kXi 4" X#i + I versus , 1 X3 = /*Xi 4- »*X| 4* F and so on for more equations and more variables. This is typically a discontinuous problem. It is discussed briefly in Sec. 10.8. 6. Having decided that xi is; an explanatory variable, does it help 140 SEARCHING FOR HYPOTHESES AND TESTING THEM to include #2 as well? The rival hypotheses are y « axi + v versus y = fax -f yx* + w Section 10.8 contains hints on this problem. 7. Having decided to include x h which one other variable should be added? y = ax\ -f fat -+• u versus y = yx\ + 6x3 + v etc. Section 10.8 applies to this problem. 8. Is it better to have a ratio model or an additive one? c v - s=* a - + u versus c = &y + yn -f* v This is discussed in Sec. 10.10. 9. Is it better to have a separate equation for each economic sector or the same equation to which is added a variable characterizing the sector? For example, consider the following rival demand models: q « ap + u for the poor a . r lL . 1 versus q = yp + ty + w g « 0p -h t> for the rich v ^ * where y is income. Section 10.11 discusses this problem. 10. (A special case of the above.) Are dummy variables better than separate equations? q = ap -f u in wartime q = yp + SQ + w versus q = /3p + v in peacetime Q = in peacetime Q = 1 in wartime This problem is a special case of the example discussed in Sec. 10.11. 11. Do variables interact? That is to say, does the size of one or more variables fortify (or nullify) the others' separate effects? For instance, if being stupid and being old (the variables s and a, respec- tively) are bad for earning income, are stupidity and old age in com- bination worse than the sum of their separate effects? The rival hypotheses are y = as 4- fa + w versus y = 73 + 8a -f- csa -f- v 10.4. EXAMPLES OF RIVAL HYPOTHESES 14! and can also be expressed as follows: y = 75 + da + «sa 4* v Null hypothesis: € = or as follows: y = as + fi + w for the young ' .' ' a « f • * i r iu ri Null hypothesis: _ . 2/ = 73 -{- 5 + v for the old J " = 1 This case is not spelled out, but the discussion of Sec. 10.6 applies to it. This list is not exhaustive. And, naturally, the above questions can be combined into complex hypotheses. Digression on correlation and kindred concepts This is a good place to gather together some definitions and theorems and to issue some simple but often unheeded warnings. It is also an excellent opportunity to learn, by doing the exercises, to manipulate correlations and regression coefficients as wall as all sorts of moments. Universe and sample. Keep in mind that Greek letters refer to properties of the universe and that Latin letters are used to refer to the corresponding sample properties. Thus, as already explained in the Digression of Sec. 1.1*2, ir*»> aw <r vv are population variances and co variances of x and y. The corresponding sample quantities 1 are m XX) m xy , m yyt the so-called "moments from the sample means," introduced in the game Digression, - 2 (*. - .*•)<¥. - y°) nij 1 To the population covarianccs c xy there correspond two types of sample quantities: those measured from the r.iean of the universe, qx V - Y (x. - zx){y. - where 8 runs over the sample S°; and those measured from the mean of the sample, namely m xy . Interchanging q xv and r,i xy does not hurt at all, in general, when the underlying model is linear, since m xy is an unbiased, consistent, etc., estimator of both q xy and <r xy , etc. There are difficulties in the case of nonlinear models, but we shall not go into them here. 142 SEARCHING FOR HYPOTHESES AND TESTING THEM where s runs over the sample 5°. The universe coefficient of cor- relation p is defined by VzxVyv and the corresponding sample coefficient r by Later on we define partial, multiple, etc., coefficients of correla- tion. In ail cases, a coefficient of correlation measures the togetherness of two and only two variables, though one or both may be compounded of several others. This elementary fact is often forgotten. For the sake of symmetry in notation, when handling several variables, we shall use x with subscripts: xi f z 2 , x 3t etc. Then we write simply p Uj m, mn for p^x*,), rfaxfch w< Xl )(*i), etc. Both p xy and r xy range from — 1 to -f-1. Values very near ± 1 mean that x and y have a tight linear fit like ax + (3y = w, with the residuals very small. A tight nonlinear fit like x 2 + V 2 = 1 does not yield a large coefficient of correlation p x „. What we need to describe this fit is p(*«xv»). And similarly for relations like ay + |3 log x « n or a?/ 2 + &x 3 « w, we need p(io g *Kv>> Pc^'Xv 1 )* respectively. 10.5. Linear confluence From now on until the contrary is stated, I shall deal with linear relations exclusively. The discission is perfectly general for any finite number of variables, but three are enough to capture the essence of the problems with which we shall be dealing. Let the three variables be X\ number of pints of liquor sold at a ski resort in a day Xi number of tourists present in the resort area Xz average daily temperature We suppose there are one or several linear stochastic relationships among some or all of these variables. The least-squares-regression 10.5. LINEAR CONFLUENCE 143 coefficients are denoted by a's and 6's, with a standard system of subscripts. Begin with regressions among X\ } Xt, X% } taken two at a time; there are six such regressions. In (10-1) below these regressions are arranged in rows according to the variable that is treated as if it were dependent and in columns according to the variable treated m independent. X\ = ai.2 + 612X2 Xi = Oi. 3 + bitXt Xi = a<i.\ 4* 621X1 . X 2 = a 2 .3 4- 623X3 (10-1) X 3 = a 3 .i + 631X1 Xz = a 32 + 632X2 In each subscript, the very first digit denotes the dependent variable. If there is a second digit before any dot appears, it denotes the inde- pendent variable to which the coefficient belongs. Digits after the dot (if any) represent the other independent variables (if any) present elsewhere in the equation. The order of digits before the dot is material, because it tells which variable is regressed on which. The order of subscripts after the dot is immaterial, because these digits merely record the other "independent" variables. The same three variables can be regressed three at a time. There are three such regressions: Xi — <Zi.23 + 612.3X2 + 613.2X3 X 2 = a 2 .i3 + 621.3X1 + 623.1X3 (10-2) X3 = a 3 .i2 4- 631.2X1 -f- 632.1X2 As an exercise, consider the four-variable regression Xi = ai.234 4" 61:5.34X2 4- 613.24X3 4* 614.23X4 and fill in the missing subscripts in X3 = a__. 4" 6__.__Xi 4* 6 . X2 4- 6 . X4 Returning to our liquor example, suppose we decide to measure the three variables not from zero but from each one's sample mean. If primed small letters represent the transformed variables, we know that the a's drop out and the 6's remain unchanged. This is so because the model is linear. Our relations (10-1) and (10-2) now become x[ = 612X2 • • '• x' z = bn.al 4- 632.1*2 144 SEARCHING FOR HYPOTHESES AND TESTING THEM Exercises 10.A Prove r<x,)<x,) = *Vx*t'> m r i2, that is to say, that correla- tion does not depend on the origin of measurement. 10. B Prove rlj = bifin Hint: Use moments. This relation says that the coefficient of correlation between two variables equals the geometric mean of the two regression slopes we get if we treat each in turn as the independent variable. The less these two regressions differ, the nearer is the correlation to +1 or — 1. 10.6. Partial correlation Two factors may account for x[, the sale of a lot of liquor: (1) there are many people (x\) ; (2) it is very cold (zj). This relation is expressed X[ m blMXl + fris^a (10-3) But the reason that (1) there are many people in the resort is (a) that the weather is cold, and (possibly) (&) that a lot of drinking is going on there, making it fun to bo there apart from the pleasure of skiing. This is expressed x'z = b 2 i.zx[ + 6 2 3- izj (10-4) Suppose we wanted. to know whether liquor sales would be correlated with crowds in the absence of weather variations. The measure we seek is the partial correlation between x[ and x' 2 , allowing for x' z . This measure is symbolized by 7*12.3. It is interpreted as follows: Define the variables y[ = x[ - 613.2^ (10-5) vi - 4 - &,m*J " (10-0) The j/8 are sales corrected for weather only and tourists corrected for weather only. If we have corrected both for weather, any remaining covariation between them is due to (1) the normal desire for people to drink liquor (the more tourists the more liquor is sold), (2) the possi- bility that some tourists come to enjoy drinking rather than skiing (the more liquor, the more tourists), and (3) a combination of the first two items. 10.7. STANDARDIZED VARIABLES 145 The partial coefficient of correlation is defined by Exercises 10.C Prove r 2 i.3 = ru.*. j 10.D Prove r\ 2 . 3 - 612.3&21.3.J This is analogous to Exercise 10.B. Hint: Substitute (10-6) and (10-7) into (10-3) and (10-4). 10.E Prove — 7*12 — ri3r23 12,8 (1 r Wl - A,)* from definition (10-7) and Exercises 10.C and 10.D. 10. F Give a common sense! interpretation of the propositions in the above three exercises. 10.G All this generalizes to four and more variables, but notation gets wry mossy. Exorcise 10.D generalizes into tho proposition: Every (partial or otherwise) [coefficient of correlation equals tho geometric mean of the two relevant regression coefficients. So, for example, r 2 u . u = 612.34621.34. Let r stand for the matrix of all simple coefficients of correlation r»y, and let Rij stand for the minor of r»y. Then Exercise 10.E is rewritten R 11 Ri and with four variables rim - ii-34 - fii'H - rlva - g^gjj and so on for any number of variables, the dimension of R growing all tho while, of course 10.11 Show that rj 2>8 => R\%/R\\Rn holds but collapses into an identity when there is no third variable. i 10.7. Standardized variables Let us now measure Xi, X 2 , X 3 not only as departures x{, x' 2 , x\ from their sample means but also in units equal to the sample standard 146 SEARCHING FOR HYPOTHESES AND TESTING THEM deviation of each. So transformed, the variables are called just $h %h **• This step is useful in bunch map analysis (see Sec. 10.8). When this is done, nothing happens to either the population or the sample correlation coefficients, but the regression parameters between the variables do change. Exercises 10.1 Prove m XlXt = (m XlXl m XlXl )->%< Xl o <*,'). 10. J Prove that r^'x*,') «- r (*,)<*,) "■ rn by using Exercises 10. B and 10.C. 10. K Denote the regression coefficients among x[ 9 x' 2f x' s by the letter b and the corresponding coefficients among the standardized variables x h x 2 , xz by the letter a, with appropriate subscripts. Inter- pret an.3, 021.3, Aim; show that they differ in meaning from ai.23, 02-13, flan, respectively. 10.L Show that an » 6it(ffiu/ttiu)*S and, in general, that <ty.» = bij,k(mjj/mu)K t 10. M Show, by using Exercise 10.L, that r*.* ■ flatty.*. 10.N Show that rn «■ an. This is a very important property, which says that regression and correlation coefficients are identical for standardized variables. 10.O Let x" = (Xi - eX,)((r tl )-^. Prove p< x< ")<*/'> = P(x i nx,) f and interpret. 10.8. Bunch map analysis Bunch maps are mainly of archaeological or antiquarian interest. They seem to have gone out of fashion. Beach (pp. 172-175) gives an excellent account of them which I shall not repeat here. I shall merely discuss necessary and sufficient conditions under which bunch maps help to accept or reject hypotheses. Turn to the example of liquor sales, skiers, and cold weather in Sec. 10.5. Let xi, rr 2 , x$ be the three standardized variables. Let their correlation coefficients be 10.8. BUNCH MAP ANALYSIS 1 ria r u if - r 2 i 1 r n T31 T32 1 . 147 1 0.5 0.2 0.5 1 0.8 0.2 0.8 1 Compute the least squares regressions of all normalized variables, two at a time: X\ = aiiXi Xi = O13X3 xi » anX\ ...... x 2 = a 2S X3 (10-8) X3 «■ O31X1 Xz = O32X2 and then three at a time: X\ = Cti2.3^2 4" 013.2X3 X2 = 021-3^1 + O23.1X3 Xz = 031.2^1 + 032.1^2 (10-9) Construct now the unit squares, shown in Fig. 19, where marks the origin. In each block the horizontal axis corresponds to the inde- pendent variable, and the vertical to the dependent. The labels below the squares show which lis which. Refer now to the first equation Xi = Oi 2 x 2 in (10-8). From Exercise 10.N, xi = 7*12X2. Imagine a unit variation in *,he independent variable x 2 ; then the corresponding variation in Xi, according to this equation, is an. Plot the point (l,ai 2 ) in the first block of squares. Then go to the symmetrical equation x 2 ~ a 2 iXi, make Xi vary by Axi = l t and plot the resulting point (o 2 i,l) in the same block. In a similar way fill out the top row of Fig. 19, drawing the beams from the origin. In (10-9), first consider the variation in xi resulting from variations in x 2 , other things being equal.: We get three different answers from (10-9), one per equation: | AXi 4 &12-3 AX2 Axi 4 Ax 2 ; 021-3 A ^32-1 A Axi ~- Ax 2 031-2 Digressing a little, I state without proof that (10-10) Q»/.jfe R%1 148 SEARCHING FOR HYPOTHESES AND TESTING THEM A* 2 'i 1 A* 3 1 A* 3 -1 , -First place: dependent variable ^-Second place: independent variable After dot: variable allowed for Fig. 19. Bunch maps. (^22.^23) lRl2.-*13l 1 Ax 3 Therefore we get from (10-10) the three statements that A£i:Az 2 is proportional to RuiR'n, to R22-R21, and to —RniRw In the figure this is depicted, respectively, by the beams marked ( 12.3 ). The three regressions in general conflict both with regard to the slope and with regard to the length of the beams. Derive the corresponding relations for Axi : Ax 3 and A# 2 : Az 3 . These results are plotted in the last two panels of Fig. 19. 10.8. BUNCH MAP ANALYSIS j 149 i Exercise 10. P Plot the bunch maps for p = 1 i -0.6 -0.1 0.6 1 0.6 0.1 0.6 1 Scanning Fig. 19 is supposed; to tell us (1) which two variables to regress if we want to stick to two of the given three, and (2) whether a third variable is superfluous, useful, or detrimental in some very loose sense. What do we look for in Fig. 19? Three things: (1) opening or closing of bunch maps as you go from the upper to the lower panel, (2) shortening of the beams, and (3) change of direction of the bunches. There is no simple intuitive way to interpret the many combinations of 1, 2, and 3; this is the main reason why statisticians have abandoned bunch maps. The examples that follow far from exhaust the possibilities. The moral of these examples is: To interpret the behavior of the bunch maps, you must translate them into correlation coefficients r and try to interpret what it means for the coefficients to be related in one way or another. But one might as well start with the correlation coefficients, bypassing the bunch maps altogether. Example 1. The vanishing beam What can we infer if beam Ru/Rn shrinks in length? Take the extreme case Ru ~ and Ru « 0. These imply rn = *Wn and r| 3 s: l, which, in turn, imply r%* «* ±1. and rn m ±P«« Let us restrict the illustration to the plus-sign case r 2 3 = 1> f\% m ?n* The meaning of r 2 z — 1 is that x 2 and x z in the sample, uneofrected for variations in x\, are indistinguishable variables. Relation r n = fis shows that, if xi and x 3 were corrected for x h the corrections Would be identical; the resulting corrected variables are also identical. This can also be seen from the fact that in these circumstances r a^i ©quals 1. All this would, of course, be detectable from the top level of Fig. 19, signifying that three variables are too many and that any two are nearly as good as any other two. 150 SEARCHING FOR HYPOTHESES AND TESTING THEM Example 2. The tilting beam What does it mean if beam Ru/Rn tilts toward one axis without shrinking in length? For instance, let Ru 7* and Ru = 0. This implies again that r 2 ;i = ± 1, that is to say, xt = ±x s . Taking again just the + case, this signifies that the uncorrected x t and x$ are in perfect agreement. However, Ru — ru - ru ^ or ru 9* n 3 ; take the case ru < r u for the sake of the illustration. The inequality ria /^ ru suggests that the corrections of x<i and x 3 to take account of variations in X\ will be different corrections and will upset the perfect harmony. This can be seen again from __ R23 __ 7*23 — ruru __ 1 — rnris . - r2M (B,JJ,0» (1 - rh)»(l - rl t )» (1 - r},)»(l - rj,)» * In terms of our example, there is a spurious perfect correlation between Xi, the number of skiers, and ar 3 , the weather. It is spurious because some skiers come to enjoy not the weather but the liquor. However, liquor sales respond less perfectly to tourist numbers than to weather; that is, ri2 < ri 3 . Therefore, if you take into account the fact that liquor too attracts skiers, the weather is not so perfectly predictable a magnet for skiers as you might have thought by looking at r 23 = 1. The hypothesis accepted in this case is: Liquor is significant and ought to be introduced in a predictive model. Exercises 10. Q Show that, if beam ai 2 . 3 has the same slope as an, this implies O12.3 = rn and also ri 3 = r 23 and, hence, that all three beams of the bunch map come together. Interpret this. I0.R Interpret the situation where all three beams Ru/Rn, Rn/Rn, and —Rzi/Rn have the same slope. Must they necessarily have the same length? Must the common slope necessarily equal a i2 ? 10.9. Testing for linearity If the rival hypotheses are y = j3x -f- u versus y = yx* + v the matter is quickly settled by comparing the correlation coefficients fxv with r x t. v . Things become complicated if the quadratic function 10.9. TESTING FOR LINEARITY 151 contains a linear term, because the function y «■ yx* + fa <+• v con- tains the linear function j = fix + u as a special case; therefore, we would expect the correlation to be improved by adding a higher term. Thus, for any fit giving estimates #, 7, and S, r v .^ x t^ x) is bound to be greater than r v .tf x) . Correlation coefficients do not give the best tests of linearity. Common sense suggests something simpler and more intuitive. The curves in Fig. 20a represent the two rival hypotheses. If the / ,y-Px A A X la) A 2 A 3 x lb) Fig. 20. Tests of nonlinearity. quadratic is true but we fit a straight line, then the computed residuals from the fitted straight line will be overwhelmingly positive for some ranges of x and overwhelmingly negative for other ranges. These ranges are defined in terms of the intersections of the rival curves. Somewhere left of A most residuals arc negative, and to the right, most are positive. Complicated nu merical formulas for testing nonlinearity are nothing but algebraic translations of this simple test. All this generalizes quite rea dily . For instance, the test of hypothesis y = ax -f u versus a cubic is sketched in Fig. 206; a quadratic versus a cubic in Fig. 20c. And it generalizes into several variables x, y, z, etc. 152 SEARCHING FOR HYPOTHESES AND TESTING THEM la each case the test consists in dividing the range of x into several equal parts Pi, P 2 , . . . , as shown in either Fig. 21a or 216. In each part compute the average straight-line regression residual av u. If this tends to vary systematically (with a trend or in waves), the relationship is nonlinear. When we have three or more variables x> y, z and want to test linearity versus some other hypothesis, we have to extend to two dimensions the technique of Fig. 21. Let the rival hypotheses be x = a + (3y + yz 4- u versus x = 5 + cy -f- ft/ 2 + yz + Oz 2 + *yz + v In the yz plane the intersection of these two surfaces projects a hard-to-solve-for and messy curve KLMNP (see Fig. 22a). Instead of obtaining it, let us see whether we can sketch it vaguely. Divide the sample range of y arid z into chunks, as shown in the figure (they do not CO Pi Pz P 3 Pa x Pi Pz Ps Fig. 21. The interval test. need to be square, and they may overlap in a systematic way analogous to Fig. 216). In each chunk, compute the average linear residual av u, and see whether a pattern emerges. By drawing approximate contour lines according to the local elevation of av u, we may be able to detect mountains or valleys, which tell us that the true relationship is non- linear. Something analogous can be done when both rival hypotheses are nonlinear. 10.10. Linear versus ratio models The rival hypotheses here are c XI - = a-f-|3--f-w versus c = y + 5y + en + v n n where u and v have the usual properties to ensure that least squares fits are valid. 10.11. SPLIT SECTORS VERSUS SECTOR VARIABLE 153 If the ratio model is the maintained hypothesis, then we would expect av u to be constant over successive segments of the axis y/n. Translated into the projection on the yn plane, this means that av w should be constant in the successive slices shown in Fig. 226. For the linear model, av v should be constant in the squares of Fig. 22c. In general, one criterion will be satisfied better than the other and will plead for the rejection of the opposite hypothesis. If both criteria are K ^— -s [m ( / V V / / \ ( — - ) (a) lb) (0 Poor Rich (eft i Fig. 22 substantially satisfied, then there is no problem of choosing, because both formulations say that cl y, and n are related linearly and homo- geneously (7 =s 0). One formulation might possibly be more efficient than the other for reasons of "skedasticity" (compare Sec. 2.15). 10.11, Split sectors versus sector variable The rival hypotheses here are whether the demand for, say, sugar should be estimated for all consumers as a linear function of price and 154 8EARCHING FOR HYPOTHESES AND TESTING THEM income q =* yp + 6y + w (where the price paid is uncorrelated with income) or should be split into several demand functions q = ap + w, q — Pp + v, etc., one for each income class, on the ground that price means more to the poor than to the rich. For illustration it is enough if we have just two income classes, the rich and the poor, corresponding to, say, y «= 10, y « 1. Nothing essential would be added if y were taken as a continuous variable. As in Fig. 22c, construct a grid for the sample range of variables y and p. If av w is constant, the single equation q — yp -\- 8y -\- w is good enough, and, moreover, we have a = in the alternative hypothesis. If,, however, the second hypothesis is correct, not only will a be very different from /§, but av w will display contours like those of Fig. 22d. 10.12. How hypotheses are chosen In this section I am neither critical, nor constructive, nor original. I think it proper to look at the way that statistical hypothesis mak- ing and testing takes place around us. The econometrician, geneticist, or other investigator usually begins with (1) prejudices instilled from previous study, (2) vague impressions, (3) data, (4) some vague hypotheses. He then casts a preliminary look at the data and informally rejects some because they represent special cases (war years, for instance, or extremely wealthy people) and others because they do not square with the vague hypotheses he holds. He uses the remaining data informally to throw out some of his hypotheses, from among those that are relatively vague and not too firmly grounded in prejudice. At this stage he may prefer to scan the data mechanically, say, by bunch maps, rather than impressionistically. Mechanical prescreen- ing is used (1) because the variables are many, and the unaided eye is bewildered by them, and (2) because the research worker is chicken- hearted and distrusts his judgment. Logically, of course, any mechanical method is an implicit blend of theory and estimating criteria; but, psychologically, it has the appearance of objectivity. The good researcher knows this, but he too is overwhelmed by the illusion that mechanisms are objective. Having done all this, the investigator at long last comes to specifica- 10.12. HOW HYPOTHESES ARE CHOSEN 155 tion (as described in Chap. 1) ; he then estimates, accepts, rejects, or samples again. This stage-by-stage procedure is logically wrong, but economically efficient, psychologically appealing, and practically harmless in the hands of a skilled researcher with a feel for his area of study. Instead of proceeding stage by stage, is there a way to let the facts speak for themselves in one grand test? The answer is no. We must start with some hypothesis or we do not even have facts. True, hypotheses may be more or less restrictive. But the less restrictive the hypotheses are, the less a given body of data can tell us. Further readings For rigorous treatment of the theory of hypothesis testing, one needs to know set theory and topology. Klein's discussion, pp. 56-62, gives a good first glimpse of this approach and a good bibliography, p. 63. For treatment of errors in the variables, consult Trygve Haavelmo, "Some Remarks on Frisch's Confluence Analysis and Its Use in Econometrics," chap. V in Koopmans, pp. 258-2C5. Beach discusses bunch maps and the question of superfluous, useful, or detrimental variables, pp. 174-175. Tinbergen, pp. 80-83, shows a five- variable example. Cyril H. Gouldcn, Methods of Statistical Analysis, 2d ed., chap. 7 (New York: John Wiley & Sons, Inc., 1952), gives an elementary discussion of p and the sample properties of its* estimate r. CHAPTER 11 Unspecified factors 11.1. Reasons for unspecified factor analysis Having specified his explanatory variables, the model builder fre- quently knows (or suspects) that there are other variables at work that are hard to incorporate. 1. The additional variable (or variables) may be unknown, like the planet Neptune, which used to upset other orbits. 2. The additional variable may be known but hard to measure. For instance, technological change affects the production function, but how are we to introduce it explicitly? There are two ways out of this difficulty: splitting the sample, and dummy variables. When we split the sample we fit the production function to each fragment independently in the hope that each frag- ment is uniform enough with regard to the state of technology and yet large enough to contain sufficient degrees of freedom to estimate the parameters. The technique of dummy variables does not split the sample, but instead introduces a variable that takes on two and only two values or levels: when, say, there is peace, and 1 when there is war. Phenomena that are capable of taking on three or more distinct states are not suited to the dummy variable technique. For instance, it 156 11.2. A 8INGLE UNSPECIFIED VARIABLE 157 would not do to count for peace, 0.67 for cold war, and 1 for shooting war, because this would impose an artificial metric scale on the state of world politics which would affect the parameters attached to honest- to-goodness, truly measurable variables. No artificial metric scale is introduced by the two-level dummy variable. 3. The additional factors at work may be a composite of many factors, too many to include separately and yet not numerous enough or independent enough of one another to relegate to the random term of the equation. j 4. The additional variable may be known and measurable, but we may not know whether , to include it linearly, quadratically, or otherwise. j 5. The additional variable may be known, measurable, etc., but not simple to put in. To admit a wavy trend line, for instance, eats up several degrees of freedom. I In such cases the unspecified variable technique comes to our rescue, at a price, because it sometimes requires special knowledge. In the illustration of Sec. 11.2, for instance, to estimate a production function that shifts with technological change, time series are not enough. The data must contain information about inputs and outputs broken down, say, by region, or in some dimension besides chronology. i 11.2. A single unspecified variable This section is based on the technique developed by C. E. V. Leser 1 in his study of British coal mining during 1943-1953, years of rapid technological change, nationalization, and other disturbances. He fitted the function P rt ~ QtL"fini where P is product, L is labor, C is capital, g t is the unspecified impact of technology, r and t are regional and time indices, and a, are the unknown parameters. Here for exposition's sake, I shall linearize his model and drop the second specified variable. Consider then P rt « g t + aL rt + U rt (11-1) 1 C. E. V. Leser, "Production Functions and British Coal Mining" (Rfflnometrica, vol. 23, no. 4, pp. 442-446, October, 1955). 158 UNSPECIFIED FACTORS The following assumptions are made: 1. Technology affects all regions equally in any moment of time. 2. The same production function applies to all regions. 3. The random term is normal, with a period moan R 2 u « R equal to zero, and a regional mean \ <-i also equal to zero. We shall now use the notation av[r]itrt and av[^]u r< for expressions like the last two. l Now, keeping time fixed at it «■ 1, let us average inputs and outputs over the R regions. From (11-1) we get, remembering that av[r](7 f , - g h} av[r]P r < t - g tl + a av[r]L rt , (11-2) And, by subtracting (11-2) from (11-1), we get the following relation between P' ril and U rtl , which are product and labor measured from their mean values of period 1 : P' rh - «I4 + u rh (11-3) Do the same for t = 2, , . . , T and then maximize the likelihood of the sample. Under the usual assumptions, this is equivalent to minimizing the sum of squares rt The resulting estimate of a is & = ^ (11-4) ThL'L' In this expression the moments are sums running over all regions and time periods. 1 Read "average over the r regions," "average over the t years." 11.3. SEVERAL UNSPECIFIED VARIABLES 159 Having found a, we can go pack to (11-2) to compute the time path g t of the unspecified variable, technology. The method I have just outlined has several advantages: 1. It uses R XT observations (a largo number of degrees of freedom) in estimating the parameter al. 2. Unlike split sampling, it i obtains a single parameter estimate for all regions and periods. 3. It yields us an estimate of the unspecified variable (technological change), if it is the only other factor at work. 4. This technological change does not have to be a simple function of time. It may be secular, cyclical, or erratic; it can be linear, quadratic, or anything else. ! 5. The method estimates, in addition to technology, the effects of any number of other unspecified variables (such as inflation, war, nationalization) which at any moment may affect all regions equally. i The chief disadvantage of! the technique is that the unspecified variable g t has to be introduced in a manner congenial to the model, that is to say, as a linear term in a linear model, as a factor in Leser's logarithmic model, and so forth j otherwise it would not drop out, as in (11-3) when we express the specified variables as departures from their average values. For the unspecified variable technique to be successful it is necessary that the data come classified in one more dimension than there are unspecified variables. Thus P and L must have two subscripts. Moreover, each region must have coal mines in each time period. 1 11.3. Several unspecified! variables Imagine now that we wish to explain retail price P in terms of unit cost C, distance or location D, monopoly M, and the general level of inflation J. Cost is the specified variable, and location, monopoly, and inflation are left unspecified for one or another of the reasons I 1 There are methods for treating lacunes, or missing data, but these are rather elaborate and will not be discussed in this work. The usual way to treat a lacune is by pretending it is full of data that interpolate perfectly in whatever structural relationship is finally assigned to the original data. 160 UNSPECIFIED FACTORS recounted in Sec. 11.1. The model, assumed to be linear, is Pfirt = Mi + D r +Jt + aCfM + Uf irt (11-5) where the subscripts /, t, r, t express firm, industry, region, and time. The model, as written, maintains that the degree of monopoly is a property of the industry only, not of the region or of inflationary situation or of interactions among the three. Similarly, inflation is solely a function of the time and not of the degree of monopoly and location of industry. Note again that the data have to come classified in one more dimension than there are unspecified variables. Thus P and C must have four subscripts, one for each of the unspecified variables, plus an extra one (firm /). Moreover, unless we have lacunes, each firm must be present in each industry, region, and time period. The firms of Montgomery Ward and Sears Roebuck would do, 1 and the industries they enter can be, say, watch retailing, tire retailing, clothing retailing, etc. In that case, a is estimated analogously to (11-4) by & = rnp'c'/m C 'c' f where the moments are sums running over /, t, r, t. Having esti- mated a, we can now define a new variable S, the price-cost spread S = jP - dC. The model is now Sfrt = Mi + D r + J t + v firt (11-6) Estimating M , Z), and J is the so-called problem of linear factor analysis. 11.4, Linear orthogonal factor analysis Linear factor analysis attempts to explain the spread S as an additive resultant of two or more separate factors; in the example of (11-6) there are three factors: monopoly, region, and inflation. Nothing essential is lost if we confine ourselves to two factors, say, monopoly and inflation, and consider the simpler model Sf* - Mi + J t + p,u (H-7) To grasp its essence, imagine that there are no random disturbances (y = 0) and that there is only one firm, which sells three products 1 Provided both exist in all time periods, regions, and industries included in the sample. 11.4. LINEAR ORTHOGONAL FACTOR ANALYSIS 161 (tires, watches, clothes) over 5 years. Observations can be put in a 3-by-5 table or matrix whose rows correspond to the commodities and columns to the years: S = Factor analysis seeks to express this table as the sum of two tables M and J of similar dimensions, the first with constant rows and the second with constant columns*. s t , $12 $13 S u $15 $21 $22 $23 $24 $25 $31 $32 $33 $34 $35 M = Mi Mx Mr Mr j Mr Mi Mi M 2 Mt \ M 2 Mi Mz M z M 3 i Mz J = J\ Ji J* J\ */§ Jl Ji Jz Ji JjJ J\ Jz Jz Ji J& In a practical problem this cannot be done exactly, particularly if several firms are involved. This is the familiar problem of conflicting observations, which is treated in Sec. 7.2. In practice, some com- promise is found which gives the M and J that "fit best" the observa- tions S. A graphic way to express the problem of factor analysis is the following. You are given a rectangular piece, say, 3 by 5 miles, of a topographical map with contour lines showing the elevation at various spots. You are supposed to find a landscape profile running from north to south and another one running from east to west with the property that, if you slide the bottom of the first perpendicularly along the humps and bumps of the second, the top crests describe the original surface of the 3-by-5 map. The same happens if you interchange the roles of the two profiles. The two profiles are kept always perpen- dicular to each other; and this is why the literature calls the two fac- tors M and J orthogonal (that is to say, right-angled) . (See Fig. 23.) Computing differences among the various entries in M and J is a simple matter under the usual assumptions. Again, we minimize the expression 2' fit with respect to M h M if Mz, Ji, J 2, Ji, Ji, Ji- Thus the solution for 162 Mi is and that of J% is UNSPECIFIED FACTORS {J*»-'2*} FT ^3 = -^— - i ~ F/ (11-8) (H-9) where F, /, jT are the total number of firms, industries, and time periods, respectively. Note that, to estimate the degree of monopoly in the first industry, we need knowledge of inflation in all years; to estimate inflation in year 3, we need measures * " monopoly for all Elevation Elevation y*K v/\y-"~~ North 3 miles South West Fig. 23 5 miles East industries. Equation (11-8) can be rationalized as follows: to estimate the effect of monopoly in the first industry, disregard the price-cost spread in all other industries, and compute the over-all (firm-to-firm and period-to-period) average spread in industry 1 : £S/u FT From this deduct the average inflationary impact 2* What is left is the monopoly impact. 11.5. TESTING ORTHOGONALITY 163 11.5. Testing orthogonality It is entirely possible for inflation's impact on the price-cost spread to be related to monopoly. Indeed, there is evidence from th© Second World War that price control was more successful in monopolistic industries (and firms) than in competitive ones. A monopolist or monopolistic competitor is recognized and remembered by the public. If he takes advantage of inflation, he may lose goodwill or perhaps be sued by the government as an example to others. If monopoly and inflation interact in this way or in some other way, the linear model (11-7) is not applicable. Because it is simple, however, we may adopt it as our null hypothesis, fit b, and look for a systematic pattern dis- crepancy as a test of the hypothesis. The formulas for doing this are rather complicated expressions, which I shall not bother to state. Intuitively the test is quite simple. If by rearranging whole rows f.nd whole columns, table S can be made to have its highest entry in t,ho upper left-hand comer, its smallest entry in the lower right-hand corner, with each row and column stepping down by equal amounts, the null hypothesis holds. For example, s = [S 15 17 5] can be rearranged thus: S' = ■[S 14 12 I?] Note that S' = [-15 15 12 12 Sl+R o\ < n - 10 > To state the same test in terms of our geographic profiles of Sec. 11.4: Cut up the original map into north-south strips, rearrange, and then glue them together. Then cut the resulting map into east-west strips and rearrange these. Should this procedure produce a map of a terri- tory (1) sloping from its northwest corner down to its southeast corner, (2) with neither local hills nor saddle points, and (3) such that, if you stand anywhere on a given geographical parallel and take one step south, you step down by an equal amount, say, 3 feet, and (4) such 164 UNSPECIFIED FACTORS that, likewise, if you start from any point on a fixed meridian, one eastward step loses the same elevation, say, 2.1 feet, then the factors are orthogonal. In arithmetical terms, having estimated Afi, M2, . . . , Js, rearrange the rows and columns so that the most monopolistic industry occupies the top row and the most inflationary year occupies the leftmost column. Compute the residuals tfu = Sfu - ifti - Ji and place their sums / in the appropriate row and column. Any run, or large local concentra- tion, of mostly positive or mostly negative residuals is evidence that monopoly and inflation have interaction effects (are not orthogonal factors). 11.6. Factor analysis and variance analysis Unspecified factor analysis, the technique explained in this chap- ter, should be carefully distinguished from variance analysis (and from -factor analysis in the principal components sense of the term). Both techniques make use of a row-column classification, and both usually proceed on the null hypothesis that rows and columns do not interact. But here the similarities end. Factor analysis meas- ures the row and column effects for each row and column, i.e., it computes the unspecified variable. Variance analysis attributes various percentages of total variance 1 to differences among all rows, to differences among all columns, and the remainder to chance. Factor analysis endswith / -f- T estimates $1 u&t, . . . ,lSti\3i,J% 9 . . . ,«/r. Variance analysis ends with three percentages expressing row variance, column variance, and unexplained variance in terms of total variance. 1 Total variance in terms of the example, model (11*7), is Y0S/« -avS)» fit FIT where av S is av[/i7]<S/,r, or the average spread over the entire sample. 11.6. FACTOR ANALYSIS AND VARIANCE ANALYSIS 165 In the course of analysis of variance, row means (14% and 12%) and column means (16, 13, and 12) are computed, but they are only aux- iliary quantities, not estimates of factor impacts. However, the differences in these two sets of means are equal respectively to the differences in the impact [(2 and 0) and (15, 12, and 11)] of the two variables into which S' is factorable [see equation (11-10)]. It is not my intention to go into the details of variance analysis. Just three comments about it: 1. The reason why people analyze variance and not the fourth or seventeenth moment of the sample is this: A normal distribution with zero mean (such as the error term V/u) can be completely described by its variance. The variance is a sufficient estimate, for it contains all the information that is implicit in the assumed distribution. 2. Under orthogonality, row, column, and unexplained variances add up to total variance, just as the square on the hypotenuse equals the sum of the squares on the other sides of a right-angled (orthogonal) triangle. 3. Under normality and orthogonality, variance ratios have certain convenient distributions, which are suitable for testing the null hypothesis (that rows or columns differ only by chance). Further readings Harold W. Watts, "Long-run Income Expectations and Consumer Saving," in Studies in Household Economic Behavior, by Dernburg, Rosett, and Watts (Yale Studies in Economics, vol. 9, pp. 103-144, New Haven, Conn., 1958), makes judicious use of dummy variables. Robert M. Solow, "Technical Change and the Aggregate Production Functions" (Review of Economics and Statistics, vol. 39, no. 3, pp. 312-320, August, 1957), computes the unspecified variable "technology" not, as we have done in Sec. 11.2, by interregional aggregation, but by using the marginal productivity theory of distribution. Variance analysis is a vast subject. See Kendall, vol. 2, chaps. 23 and 24. CHAPTER 12 Time series 12.1. Introduction A time series x(t) — [x(l), . . . , x(T)] is a collection of readings, belonging to different time periods, of some price, quantity, or other economic variable. We shall confine ourselves to discrete, consecutive, and equidistant time points. Like all the kinds of manifestations with which econometrics deals, economic time series, both singly and in combination, are generated by the systematic and stochastic logic of the economy. The same techniques of estimation, hypothesis searching, hypothesis testing, and forecasting that work elsewhere in econometrics work also in time series. Why then a chapter on time series? Why indeed, were it not for the large amount of muddle and confusion we have inherited from many decades of well-intentioned but faulty investigations. The earliest and most abused time scries are charts of the business cycle and security market behavior. Desiring knowledge, business cycle "physiologists" avoided all models, assumptions, and hypotheses in the hope that the facts would speak for themselves. Pursuing profit, stock market forecasters have sought and are seeking (and their clients are buying) short cuts to strategic extrapolations; they have 166 12.2. THE TIME INTERVAL 167 cared nothing about the logic, whether of the economy or of their methods. Their Economistry is the crassest of alchemies. The key ideas of this chapter are these: Facts never speak for themselves. Every method of looking at them, every technique for analyzing them is an implicit econometric theory. To bring out the implicit assumptions for a critical look, we shall study averages, trends, indices, and other very common methods of manipulating data. I do not mean to condemn the traditional approaches altogether. Certainly, physiology and "mere" description can do no harm — for ultimately they are the sources of hypotheses. To look for quick, cheap, and simple short cuts to forecasting is a reasonable research endeavor. Furthermore, modern machines can help by doing much of the dull work, provided that an intelligent being is available to study their output. 12.2. The time interval Up to now I have carefully avoided any discussion of time. In the model of Chap. 1 C t - cc + yZ t + u t (12-1) what does t — 1, 2, . . . , T represent, and why not select different intervals? The secret is that the time interval t, the parameters a and y, the variables C and Z, and the stochastic term u must be defined not without thought of but with regard to one another. If the time interval is short, then y must be the short-run marginal propensity to consume. If t is a year, then it makes sense for Z to be treated as predetermined. As the time interval is shortened, more and more variables change from predetermined to simultaneously determined. With shorter and shorter time periods, the causes that generate the random terms overlap more and more and invalidate the assumption of serially independent random disturbances. In certain cases we deliberately reduce the number of time Intervals of our data in order to bring time into agreement with the parameters and stochastic assumptions. For example, if we are trying to estimate a production or cost function and have hourly data for inputs and outputs, we may lump these into whole working days; otherwise the disturbances during the morning warm-up period, coffee break, lunch 168 TIME SERIES time, and the Various peak fatigue intervals are not drawn from the same Urn of Nature. The smoothing of time series must be done with care. In the above example, if the purpose is to make the random disturbance come from the same Urn in each interval, then overlapping as well as nonover- lapping workdays will do. If we also want the disturbances to be serially independent, then only nonoverlapping days should be used. Digression on- moving averages and sums Moving averages differ from moving sums only by a constant factor P equal to the number of original intervals smoothed together. If P is even ( = 2N) the average or sum should be centered on the boundary between intervals N and N + 1. If 2N -+- 1 intervals are averaged, center on the (N -f* l)st. There are many smoothing methods besides the unweighted moving average. We are free to decide on the span P of the moving average and on the weight to be given each position within the span. Given P successive points, we may wish to fit to them a least squares quadratic, logistic, or other curve. Every particular curve implies a particular set of weights, and con- versely. Fitting a polynomial of degree Q through P points can be approximated by taking the simple moving average of a simple moving average of a simple moving average . . . enough times and with mi table spans. All this is straightforward and rather dull, unaccompanied by theoretical justification. What makes moving averages interesting is the claim that they can be used to determine and remove the trend of a time series. We shall see in Sec. 12.8 how dangerous a technique this is. As we shall see in Sec. 12.5, mov- ing averages give rise to broad oscillations where none exist in the original series. 12.3. Treatment of serial correlation The term serial con-elation, or autocorrelation , means the noninde- pendence of the values u t and u t -e of the random terms. The term autorcgrcssion applies to values avand x t -$ when cov (x t) x t -o) ^ 0. 12.3. TREATMENT OF SERIAL CORRELATION 169 In this section we consider briefly (1) the sources of serial correlation, (2) its detection, (3) the allowances and modifications, if any, that it occasions in our estimating techniques, (4) the consequences of not making these allowances and modifications. Random terms are serially correlated when the time interval l is too short, when overlapping observations are used, and when the data from which we estimate were constructed by interpolation. Thus, if, in Ct - a + yZ t + u t t measures months or weeks, then the random term has to absorb the effects of the months' being different in length, weather, and holidays, effects which are not random in the short period but which follow a cycle of 365 days. If, however, t is measured in years, then all these influences are equalized, one year with another, and u t loses some of its autocorrelation. Similarly, if successive sample points are dated "January to December/' "February to January," "March to Febru- ary," and so on, successive random terms are correlated at least 10/12 (10 being the number of months common to successive samples). Frequently the raw materials of econometric estimation are con- structed partly by interpolation. For instance, there is a ceniui in 1950 and in I960. Annual sample surveys in 1951, 1952, . . . meas- ure births, deaths, and migrations; these data, cumulated from 1950, should square with the census population figure of 1900. Slnej this seldom happens, the discrepancy in the final published figures tl apportioned (in general, equally) among the several years of thy ii@id@s The resulting annual figures for birth rate, etc., share equal portions of a certain error of measurement and are, therefore, correlated more than they otherwise would be. In a model that uses annual data on the birth rate and assumes that it is measured without error, it is the random term that absorbs the year-to-year correlation. We shall illustrate with the simple model (12-1). There are two ways to detect serial correlation. One is to maintain the null hypothesis that none exists: cov (u h Ut-$) - (12-2) estimate the model on this assumption, and then check whether m (t/,)(ti,-*; is near zero. The other way is to maintain that the random 170 TIME SERIES disturbances do have a serial connection, such as u t = f iw«-i +•"••+ teUt-e + v t (12-3) (with v t random and nonautocorrelated), and estimate the fs to see whether they are significantly different from zero. The first method is arithmetically easier, though a less powerful test. This requires explanation. The likelihood function of our sample is the same as (2-4) : L - (2tt)- 5 / 2 det (d ttU )-* exp [-Mu(* uu )- l u] where u stands for th(3 successive random disturbances (ui,W2, . . . ,Us). In minimising L by a and 7, we should get the greatest efficiency in & and $ if we took account of the fact that d is no longer diagonal when there is serial correlation among the disturbances. The null hypothesis cov (u t) u>-e) = 0, though it does not bias 61 or 1 or make them inconsistent, does nevertheless increase their sampling variances and covariances. The #'s are computed with the help of the inefficient & and i and are themselves inefficient estimates of the true dis- turbances. Therefore w^xa..*) is an inefficient (i.e., overspread) estimator of cov (u t ,v t -$) and provides a flabby test of serial correlation. It does not reject the null hypothesis with so much confidence as a more powerful test (i.e., one associated with a very pinched distribution W(tf.)<«._0))- Instead of testing by m^,)^,_ B ) it is recommended that we compute the expression ■which happens to have convenient properties, which are of no concern to the present discussion. It is easily seen that, if m^^^g) = 0, then D(0) = 2. Large departures from this value indicate that the null hypothesis is untrue. The second method for taking into account the serial correlation of the random disturbance is more efficient than the first, but biased. To see this, consider the special case Ct^a + yZt + Ut (12-4) u t = fu,-i + v t (12-5) 12.4. LINEAR 8YSTEMS 171 As good simultaneous-approach proponents, we combine the two equations as follows: C t - fC«-i - a + y(Z t - fZ«..i) + V t (12-6) and maximize the joint likelihood of the random disturbances v with respect to the three parameters a, ?, f . Unfortunately, not only does this lead to a high-order system of equations, but the maximum likeli- hood estimates are biased. The reason for bias is the sam§ as in Chap. 3, namely, that (12-6) is a model of decay. There is yet a third method, which is somewhat biased and somewhat inefficient. First fit (12-4) by least squares, ignoring (12-5): this step is inefficient. Then compute f from (12-5) using the residuals H of the previous step: this introduces the bias. Next construct the new variables c t = C t — fC*_i, z t ~ Z t — l%t-\ and fit by least squares c t = a + yz t + w t to get a new approximation to a and 7. Repeat the cycle any number of tfe\?s. When several equations have autocorrelated error terms, this biased second method always works in principle. The first and third methods are dangerous to use because we know practically nothing about how good Sm^ t _ e /(S — 6)mn is as an estimator of the regression coeffi- cient of u t on u t -e', nor do we know whether the cyclical procedure of the third method converges. Matters get rapidly worse the more complicated the dependence of u t on its past values. 12.4. Linear systems Most business cycle analysis proceeds on the assumption (sometimes explicitly stated, more often not) that an economic time series x(t) is made up of two or more additive components f(t), g(t) f . . . called the " trend," the "cycle," the " seasonal, " and the " irregular." Trend, cycle, and seasonal are supposed to be, in some relevant sense, rather stable functions of time; the irregular is not. We shall use the expres- sions "irregular," "random component," "error," and "disturbance" interchangeably. The word "additive" signifies, as usual, lack of interaction effects among the components. 1 In analyzing time series, the problem is to allocate the observed 1 See Sec. 1.11. 172 TIME SERIES fluctuations in x to its unknown additive components: *(<) - /(0 + 9(f) + *W + Wf (12-7) and to find the shapes of/, <7, and Jr. whether they are straight lines, polynomial or trigonometric functions, or other complicated forms. As stated, the problem is indeterminate. The facts will never tell us either how many additive terms the expression in (12-7) should have or what shapes are best. As usual, we must maintain a hypothesis — that the trend is, say, a straight line: that the cycle is some trigonometric function, e.g., 7 sin (5 -h et) and so forth, the problem being to estimate the Greek letters from data or to see how well a given formulation fits in comparison with some rival hypothesis. Trigonometric functions can be approximated by lagged expressions, such as 0(0 - 7o + yix(t - 1) + y 2 x(t - 2) + • • • + y Q x(t - Q) + t* (12-8) with appropriate coefficients. The term "linear" expresses the addi- tivity of the components of (12-7) or the linear approximation of (12-8) or both. In this section and in several more, we shall consider linear systems of a single variable x(t). Linearity in the second sense (above) is very handy, because in linear systems the number of lags in (12-8) and the values of the yB determine whether g(t) oscillates, explodes, or damps; the initial value g(0) determines only the amplitude of the fluctuations. In nonlinear systems amplitude and type are not separable in this way. We shall devote Sees. 12.5 to 12.7 to a priori trendless systems; then, in Sec. 12.8, we shall inquire how we know a system to be trendless and, if it has a trend, how this trend can be removed. 12.5. Fluctuations in trendless time series A trendless or a detrended time series can be random, oscillating, or cyclical. It is random if it can be generated by independent drawings 12.5. FLUCTUATIONS IN TRENDLESS TIME SERIES 173 from a definable Urn of Nature. It is cyclical (or periodic) if it repeats itself perfectly every 12 time periods; it is oscillatory if it is neither random nor periodic. A simple trigonometric function like sin (2irt/Q) or sin (2irt/Q) -f b sin (2irt/ti) is strictly cyclical. The combination of two or more trigo- nometric functions with incommensurate 1 periods Qi, fig, . . . [for instance, x(t) = sin (27r//fii) + «os (27r//ft 2 )] is not periodic but oscil- latory. Commensurate periods Q if 12 2 , . . . appear in (12-7) only in the trigonometric terms sin, cos, tan, etc., and not as multiplicative factors, exponents, etc. With the exception of purely seasonal phenomena (which are periodic), economic time series are overwhelmingly of oscillating type. Oscillations arise from three sources: (1) the summation of non- stochastic time series with incommensurate periods, (2) moving averages of random series, and. (3) autoregressive systems having a stochastic component. We can briefly dispose of the last case first. If x(t) is an autogressive variable x(t) - aix(t - 1) + '. . . + a H x(t - H) +u< (12-9) whose systematic part would damp if u were to be continually zero, then x(t) can be expressed as a weighted moving average of the random disturbances, and so the third case reduces to the second case above. 2 The moving average of a raidom series, however, oscillates! This proposition, the Slutsky proposition, shocks the intuition at first and, therefore, deserves some discussion. Let us take a time series so long that we do not have to worry about any shortage of material to be averaged by moving averages. Consider now a moving average spanning P of the original periods. To facilitate the exposition, let us take P amply large. Now the original series u(t) y if it is random, should itself be neither constant nor periodic. Because if it is constant, it is not random. And if it is periodic, a given value of u depends on the previous one; hence u{t) is not random. A truly random series is neither full of runs and patterns nor entirely bereft of them. Just as a true die, once in a while, produces runs of sixes or aces, so a random *Two real numbers are incommensurate when their ratio is not a rational number. 2 See Kendall, vol. 2, pp. 406-407. 174 TIME SERIES time series occasionally exhibits a run. For the sake of illustration, suppose the run is 3 periods long and Wioi = W102 = Wios = 10. Now consider what happens to its moving average in the neighborhood of the run. Let the span be large relative to the run, say, P = 17. Then the moving average has a run (less pronounced and more tapered) 19 periods long — that is to say, from the time that the right-hand end of the span includes U101 to the time that its left-hand end includes wi 03 . A moving average of a moving average of a random series oscillates even more. These simple properties are vital for the statistical analysis of business cycles. In the first place, the economic system itself operates somewhat like a moving average of random shocks: consumers, businesses, govern- ments get buffeted around by random external and internal impulses, such as weather, a rush of orders, a rash of tax arrears; the economy takes most of these things in its stride; it does not adjust instantane- ously and completely to the shocks, but rattier cushions and absorbs them over considerably larger spans than their original duration. The Slutsky proposition accounts for business oscillations as the result of averaging random shocks. In the second place, even if the economic system itself does no averaging, statisticians do. The national income, price indexes, and other data in all the fact books are averages or cumulants of one sort or another, frequently over time. Such data would exhibit oscillations even if the economy itself did not. Finally, analysts who use the moving average technique (on other- wise flawless data from an economy that is innocent of averaging) cither for detrending or for any other purpose may themselves intro- duce oscillations into their charts and so generate a business cycle where none exists. 12.6, Correlograms and kindred charts According to the Slutsky proposition, if we want to analyze a time series we shall be well advised to leave it unsmoothed and try some direct attack. It is natural to ask first whether a given trendless time series \(t) is oscillating or periodic. In the nonstochastic case the question can be 12.6. CORRELOGRAMS AND KINDRED CHARTS 175 quickly settled by the unaided eye, detecting faithful repetition of a pattern, however complicated. In the stochastic cases the faithful repetition is obscured by the superimposed random effects a:; f i Iheir echoes, if any. Define serial correlation of order 6 as the quantity r \ = cov fa,x t -e) (yarxtvarxt-o)** A correlogram is a chart with on the horizontal axis and p(0) or its estimate r(0) on the vertical. A strictly periodic time series has a periodic correlogram with always the same silhouette and the same periodicity. If the former is damped, so is the latter. A moving average of random terms has a damped (or damped oscillating) cor- relogram of no fixed periodicity. A nonexplosive stochastic auto- regressive system like (12-9) is a damped wave of constant periodicity. Correlograms are not foolproof. They may or may not identify correctly the type of model to which a given time series belongs. For instance, if the random term in (12-9) is relatively large, the correlogram of x(t) will compromise between the strictly periodic silhouette of the exact autoregressive system <xix(t — !)+•••+ <xnx(t — H) and the nonperiodic silhouette of the cumulated random terms ur -f ciun-i + • • • -f- a H ~ l u\. In general, it will neither damp progressively nor exhibit any fixed periodicity. This is very unfortunate, because, from a priori theory, we expect to meet such time series often in economics. Businoss cycle and stock market analysts are often interested in turn- ing points in a series and in forces bringing about these turning points rather than in the amplitude of the fluctuations. This leads naturally to periodograms. To take an example from astronomy, imagine that the time series \(t) measures the angle of Mars and Jupiter with an observer on earth. We know this series to be analyzable into four components: the revolutions of Earth, Mars, and Jupiter round the sun plus the minor factor of the earth's daily rotation. Periodograms are supposed to show, from evidence in the time series itself, the four relevant periods fii = 3G5.26 days, fi 2 = 687 days, Q 3 = H-86 years, and ft 4 = 24 hours. This is a relatively easy matter if the series is nonstochastic, if we know beforehand that only four basic periods are involved, or both. The composite series fluctuates and undergoes accelerations, decelerations, and reversals occasioned by the move- 176 TIME SERIES mcnts of its four basic components. All this is captured by the formulas 2irs A « y X(8) cos -—- where 12 is an unknown period. The periodogram is a chart with 12 on the horizontal and S 2 on the vertical axis. The value S 2 — A 2 + B 2 attains maxima when 12 takes on the values Qi, 122, 123, 124. The tech- nique works fairly well if x(/) is indeed composed of periodic (trigo- nometric) terms and a random component. It works very badly when x(0 is autoregressive, because the echoes of past random dis- turbances are of the same order of magnitude as the smaller periodic components of x(t) = ct\x(t — 1) -f- • • • -f aux{t — H) and claim the same attention as the latter in the formula for S 2 . Like the cor- relogram, the periodogram fails us where it is most needed, that is, in the analysis of an economic time series which we know to be auto- regressive and stochastic though we know nothing about the number and size of its 12s. 12.7. Seasonal variation The easiest periodic components to measure and allow for are those tied to astronomy. We know that the cycle of rain and shine repeats itself every 365 days, and we would naturally expect this to be reflected in any time series having to do with swim suits, umbrellas, or number of eggs laid by the average hen. The same is true of cycles imposed by custom or by the state, for instance, the seven-day recurrence of Sunday idleness, the Christmas rush, the preference of employees for July holidays. In all these cases the period itself is known, although it may be complicated by moving feasts, the varying number of days in a month, and the occasional occurrence of, say, a short month containing four Sundays plus Easter or a Friday the thirteenth. The problem here is not to find the seasonal period but its profile. It is one thing to recognize and measure the seasonal profile and another to remove it. Sometimes we want to do the former, some- times the latter, depending on our purpose. 12.7. SEASONAL VARIATION 177 If the purpose is to forecast cycles and trends, it is a false axiom that a seasonally adjusted series is a better series. The only time we are justified in taking out seasonal fluctuations is when we believe that businessmen know there is seasonality, expect it, and adjust to it in a routine way, either consciously in a microeconomic way or in their totality when many millions of their microeconomic decisions interact to form the business climate. So, for forecasting purposes, it is legitimate to wash out seasonal movements only when they are washed out of the calculations of consumers and businessmen. If a seasonal exists but people have not detected it, it should be left in. For instance, if it were true that the stock market had seasonal properties unknown to its traders, they should not be corrected for, because the participants mistake these for basic trends and react accordingly. Conversely if the relevant people think there is a seasonal when in fact none exists, its imagined effect should be allowed for by the fore- caster of trends. Suppose, as an example, that the market believes that the U.S. dollar falls in the summer relative to the Canadian and rises in the winter. This imagined seasonal should be taken 'into account in analyzing the significance of monthly or quarterly import orders. To deseasonalize every time series may increase knowledge in all cases, but it increases forecasting accuracy only when the time has come when the market has learned all the real seasonals and imagines none where none exist. Every formula either for measuring seasonals or for removing them is an implicit economic theory, which may be appropriate for one economic time series and inappropriate for another. For Instance, treating the seasonal as an additive factor implies that a given absolute deviation from some normal or trend is equally important In all months. This is false in the case of, say, housing-construction starts in Labrador; the average number of these is, let us assume, 4 in December and 50 in July. Then 5 starts in December is a more serious departure than 51 in July. However, many analysis use additive seasonals for each and every time series. If the genuine seasonal period is 12 months, its profile can be approxi- mated by averaging the scores of several Januaries, then several Februaries, etc. This technique gives a biased estimate of the seasonal profile if the time series is autoregressive, unless random disturbances 12 months apart are. independent. To see this, take (for simplicity 178 TIME SEMES only) the one-lag autorcgressive model x(i) = ax(t — 1) + 7 sin -^ + u t and let represent the first January and 12 the following one. For sim- plicity, let us average the values x(0) and a: (12) of just Januaries. Then we have z(12) = a 12 x(0) + a l2 Ui + a ll u 2 +•••■+ otUn + W12 + 7 sin 2ir which involves a moving sum of random terms, and this sum oscillates, as we already know from Sec. 12.5. The oscillation due to the random term will be confounded with the amplitude of the true seasonal. This will manifest itself in two ways: either the seasonal will seem to shift or, if it does not shift, it will contain the cyclical properties of the cumulated random effects. 12.8. Removing the trend Ultimately, economic theory and not the facts tell us whether the trend (or longest-term movement) is linear or otherwise. If we obtain the trend as what is left after cycles and seasonals have been taken out, the trend inherits all the diseases and pitfalls of the seasonals. In particular, if we use a moving average to obtain the trend, we are almost certain to get it wrong. To see this, suppose that we have a trendless cyclical and stochastic phenomenon, say x t = sin — -f u t depicted in Fig. 24. If the span P of the moving average is longer than the true period 0, then the moving average (dashes in Fig. 24) exaggerates the oscillations and imposes a long wavy trend where none existed. Or again, if the system is autoregressive and trendless, s(0 - <*ix(t - 1) ■+ • • • + otnx(t - H)+u t the moving average of the random term contributes its oscillations to the systematic ones and, by the same process as that shown in Fig. 24, imposes a long, wavy trend. Naturally, distortions like these arise 12.8. REMOVING THE TREND 1?9 when x(t) truly contains some systematic trend. Moving averages distort both the trend and the cycles. The variate difference method eliminates trends on the ground that any trend can be approximated by a polynomial of some degree N and that such a polynomial can be brought down to zero after N + 1 differentiation??. Therefore, let x{t) - 7o + yit+ • • • + yst" + f(t) + ut (1240) where f(t) and u t are the cyclical and random factors. The method Fig. 24 proceeds as follows: 1. Difference (12-10) once: x(t - 1) - 70 + 7i(* - 1) + • • ' + y*t N ~ l + f(t - l) + w< -t <is-ii) 2. Subtract (12-11) from (12-10) and call y(t) the new variable x(t) — x(t — 1). We do not need to write out y(t) in full Ml fiote only that its trend is a polynomial of one degree less than th§ poly- nomial in (12-10) and that its random component is v t = u t - u t -i (tS-12) 3. Do the same for y(t), and define z(t) = y(t) — y(t — 1); this too reduces the power of the trend and generates a random component w t v t — v t -i = u t — 2ut-\ + Ut-2 (12-13) 4. Continue in this fashion as long as the estimated covafiances m XX) rriyy/2, m„/6 decrease. (The correcting denominators are dis- cussed below.) To see what is going on, consider the first quadrant of Fig. 25, whore 180 TIME SERIES x(t) was taken to be a second-degree polynomial of L Then y(t) is a sloping straight line, and z(t) is a level one. The variance of x(t) is quite high, because x assumes many widely different values as t changes. The variance of y(t) is smaller, because, though y varies, it varies more smoothly than x. And z does not vary at all. The variate difference method reduces tho trend to z, and any remaining variation in the resulting scries must be due to nontrend components. Several things are wrong with this method. First, if x extends to the second quadrant of Fig. 25, say, symmetrically, its covariation with its lagged values may be very small or even zero. And, in general, Fig. 25 a high-degree polynomial, because it twists and turns up and down, may exhibit a smaller lag covariance than a low-degree polynomial. Hence we should faithfully carry on successive differencing in spite of a drop in the series m xx , m vv , m M . But suppose we do. How are we to tell when the polynomial of unknown degree has finally died down? For meanwhile, as (12-12) and (12-13) show, we are performing moving averages of the cyclical component and, for all we know, this component may increase or decrease. Finally, the variate difference method cannot come to any stop if its cyclical component has a short lag. For instance, the first differences of 1, — 1, 1, — 1, . . . are 2, —2, 2, —2, . . . , and tho first differences of the latter are 4, —4, 4, —4, and so on. Now a word about the correcting denominators. If u t itself is serially 12.9. HOW NOT TO ANALYZE TIME SERIES 181 uncorrelated, then, from (12-12), the variance of v t is twice that of u tf since var v t = cov (u t u t ) - 2 cov (w<,W/-i) + cov (u,_i,u«_i) = cov (u h u t ) H- cov (u t -i,u t -.i) ■ 2 COV (f#|t«i) Similarly, var w t = var (w* - 2w<„i + W/-2) = var u t + 2 var w<_i + var w<_j = var t*i and so on for higher-order differences. In my opinion, all these methods for detecting or eliminating the trend have serious imperfections. The way out is, as usual, to specify the algebraic form of the trend, the number of cyclical components acting on it, to make stochastic assumptions, and to maximize the likelihood of the sample. The procedure is very laborious; it is generally biased, but efficient. I think it represents the best we can ever do, and I am condemning the other methods only if they are pretentiously paraded as scientific. I do admit them as approxima- tions to the ideal. 12.9. How not to analyze time scries The National Bureau of Economic Research has attracted a great deal of attention with its large-scale compilation and analysis of business cycle data. The compilation is done with such care, tenacity, and love as to earn the gratitude of all users of statistics. The analysis, however, has often been questioned. It proceeds roughly as follows: 1. Define a reference cycle for all economic activity. This is a conglomerate of the drift of several time series, accorded various degrees of importance. 2. Remove seasonal variations from the given series, say, carload- ings or business failures. 3. Divide the given series into bands corresponding to the reference cycle. 4. Within each band express each January reading as a per cent of the average January in the band, and so on to December. 5. In each of the resulting specific cycles recognize nine typical 182 TIME SERIES positions or phases. The latter may be widely spaced, like an open accordion, in a long specific cycle or tightly in a short one. The result is now considered to be the business cycle in carloadings, and constitutes the raw material for forecasting, for computing the irregular effects, and for checking whether the given series can be said to have its typical periodicity, amplitude, etc. There are variations of the procedure, some ad hoc. After what I said earlier in the chapter about the pitfalls of time series, I shall not make any further comment on the National Bureau's method. Recently, electronic computations have been programmed, mainly for removing the seasonal. 1 As they involve the use of several layers of moving averages, they are not altogether safe in the hands of an analyst ungrounded in mathematical statistics; since, however, the seasonal is the least likely to cause harm (after all, the period is correct), we may set this question aside. 12.10. Several variables and time series In Sees. 12.1 to 12.9 we have considered variables that move in time subject to shocks and to laws of motion unconnected with any other variables. It hardly needs stressing that endogenous economic variables are not of this kind, since all of them are generated jointly by the workings of the economic system. One wonders of what use is the analysis of individual time series despite the heavy apparatus of correlograms, pcriodograms, and variate differences. Assuming that several economic variables hang together structurally, what kinds of time series do they manifest? Sections 12.11 and 12.12 discuss this problem. If several economic variables are unconnected, how docs a given combination of them behave? The answer to this question (Sec. 12.13) provides a null hypothesis for judging the effec- tiveness of averages, sums, and a variety of business indicators, like the National Bureau of Economic Research "cyclical indicators" and "diffusion indexes" (Sec. 12.13). The converse problems are also of great importance to the progress of business cycle research, because consideration of individual time scries may enable us to infer the nature of the economic system without laboriously estimating each structural equation by the methods of Chaps. 1 to 9. 1 See Julius Shiskin, Electronic Computers and Business Indicators (Occasional Paper 57, New York: National Bureau of Economic Research, 1957). 12.11. TIME SERIES GENERATED BY STRUCTURAL MODELS 183 12.11. Time series generated by structural models What kinds of time series are generated when the two variables x and y are structurally related? We shall take up this question first for nonstochastic relations and then for stochastic relations under various simplifying assumptions. All our models will be complete. If the model is completely nonlagged, like the usual skeleton business cycle model r.cif <"-"> with investment taken as exogenous, then the time series for consump- tion and income have the same shape as the series for investment, as can be seen from the reduced form: (1 - p)C - a + pi (i-«r-« + / In this example the agreement is not only in the timing of turns but in the phase as well, because investment, consumption, and income are positively related. In a more extended model C = a + pY I « y + & (12-15) Y « C+/+G where investment is endogenous, government expenditure is exogenous, and investment is discouraged by the latter (5 > 0), all time series will $ coincide on timing; but when G grows / falls, and C and Y will fall if 5 is less than — 1 and will rise if it is greater. If (12-14) and (12-15) are made stochastic, all endogenous variables absorb some of the random disturbances. The random disturbances apportion themselves, one year with another, according to a fixed pattern among the endogenous variables. For instance, if u is the random disturbance of the consumption function and v of the invest- ment function, the reduced form of (12-15): (1 - p)C = (a + py) + ((3 + p8)G + U+ $V (1 - P)I = (t ~ Py) + (« - P8)G + (1 - P)v (12-16) (1 - p)Y = (a + 7 ) + (l + W + « + v 184 TIME SERIES shows that the fluctuations and irregular components are in step, though they may differ in their amplitudes. The variances of the three irregular components in (12-16) are proportional respectively to <r uu + 2/?<r uv + 0V™, (1 — P)<rw, and <r uu + 2<r U v + (Tvv Thus, if the two random terms are positively correlated (<r uv > 0), income wobbles more than consumption and consumption more or less than investment, depending on the size of the marginal propensity to consume. Let us now consider as a recursive model the market for fish. The men go to their boats with today's price in their minds, expecting it to prevail tomorrow, and work hard if the price is high. Thus tomorrow's supply depends on today's price plus weather (z). Should the price fall, the fishermen don't put the fish back into the sea; so at the end of the day all the fish is sold. Demand is ruled by current price only. d = a + 0p + u S » y + dpi + €2 + V (12-17) s = d The model can be solved for p as follows: a + fip t -f u t = 7 + Spt-i + €2< + v t which shows that price tends to zigzag (0 negative, 8 positive), falling with good weather and rising with bad, as we might expect. In (12-17), unlike case (12-10), the irregular components of the price and quantity time series are no longer constant multiples of each other, nor are they in step. This is so because randomly overeager demand {u > 0) affects not only today's price but, through its effect on the fishermen's efforts, contributes to a fall in tomorrow's price as well. The connections among phase, amplitude, and irregularity in structurally related time series become very complicated as we increase the number of variables and as we admit more and more lags and cross lags. In any representative set of economic time series it would Indeed be a marvel if closely similar patterns emerged, except between such series as sales of left and of right shoes. And yet the marvel scorns to happen. 12.12. THE OVER-ALL AUTOREGRESSION OF THE ECONOMY 185 12.12. The over-all autoregression of the economy Regardless of which came first, chickens and eggs in the long run have similar time scries, because there can bo no chicken without a previous combination of egg and chicken and there can be no egg without a previous chicken. Since the hatching capacity of a hen is fixed, say, 10 chicks per hen, and since the chicken-producing capacity of an egg is also fixed, say 1 to 1, the cycles in the egg population and in the hen population cannot possibly fail to exhibit a likeness — though, in particular short-run instances, random disturbances like Easter or a fox can grievously misshape now the one, now the other series. Orcutt has claimed 1 that something like this is true of the time series of the economy's endogenous variables. He states that the autoregressive relation x t+ i = l.Sxt - O.Sxt-i + u t +i fairly describes the body of variables used by Tinbergen in his pioneer- ing analysis of American business fluctuations. 2 Orcutt's result, if correct, would not exactly spell the end of structural estimation of econometric models, because the latter may bo more efficient, less biased, etc. However, if a correct autoregression were discovered, it would certainly short-circuit a good deal of current research. Orcutt's theorem holds only for systems whose exact part, by itself, is stable and nonexplosive. Orcutt also found that we can get better estimates of the over-all autoregression if we consider many time series simultaneously than if we consider them one at a time. This follows from the fact that Easter and foxes descend on eggs and hens inde- pendently, so that a grievous random dent in the egg population tends to be balanced by the relative regularity of the hen population. In the absence of random shocks, all the interdependent variables have the same periodicity but different timing, amplitudes, and levels about which they fluctuate. With random shocks, the periodicities are destroyed more or less depending on the severity of the shocks and their incidence on particular variables. The unaided eye can seldom 1 Reference ia in Further Readings at the end of the chapter. 2 Jan Tinbergen, Statistical Testing of Business-Cycle Theories (Geneva: League of Nations, 1939). 186 TIME SERIES recognize the true periodicity. A highly sophisticated technique can screen out the autoregressive structure by combining observations from all time series, but it is so difficult to compute that one might as well specify a model in the ordinary way. Exerciser 12.A Let g(t) and k(t) be the population of gnus and of kiwis. Let fa and 5» be the age-specific birth and death rates for gnus and aj and 7/ for kiwis. Disregard the question of the sexes. Let e and £ stand for input-output coefficients expressing the necessary number of kiwis a gnu must eat to survive, and conversely. Construct a model of this ecological system. Do something analogous for new cars and used cars. 12. B In a Catholic region, say, Quebec, the greater the number of priests and nuns, other things being equal, the smaller the birth rate, because the clergy is celibate. But the more numerous the clergy, other things being equal, the higher the birth rate of the laity, because of much successful preaching against birth control. Construct an ecological model for such a population. 12. C The more people, the more lice, because lice live on people. But the more lice, the more diseases and, hence, the fewer people. Construct the model, with suitable life spans for the average louse and human. 12.D According to the beliefs of a primitive tribe, lice are good for one's health, because they can be observed only on healthy people. (Actually the lice depart from the sick person because they cannot stand his fever.) Construct this model and compare with Exer- cise 12,0. 12.13. Leading indicators An economic indicator is a sensitive messenger or representative of other economic phenomena. We search for indicators in the same spirit in which pathology examines the tongue and measures the pulse: for quickness, cheapness, and to avoid cutting up the patient to find out what is wrong with him. A timing indicator is a time series that typically leads, lags, or coincides with the business cycle. Exactly what this means will occupy us later. We shall deal only with the leading indicators. 12.13. LEADING INDICATORS 187 From what was said in Sec. 12.12, it comes as no surprise that certain economic time series, like Residential Building Contracts Awarded, New Orders for Durable Goods, and Average Weekly Hours in Manu- facturing, should have a lead over Disposable Income, the Consumer Price Index, and so forth. The difficult questions are (1) how to insulate the cyclical components of each series from the trend, seasonal, and irregular; (2) how to tell whether leads in the sample period are genuine rather than the cumulation of random shocks; and (3) where phases are far apart, how to make sure that carloadings lead disposable income and not conversely, or that the Federal discount rate does lead and direct the money supply and not try belatedly to repair past mistakes. I am sure that, ultimately, one has to fall back on economic theory; one is forced to specify bits and pieces of any autoregressive econometric model, because no amount of mechanical screening of the time series themselves can answer the third question convincingly. In 30 years of research the National Bureau of Economic Research has isolated about a dozen fairly satisfactory leading indicators out of 800-odd time series. 1 I think, however, that in nearly all cases, a priori considerations would have led to the selection of these leading series without the laborious wholesale analysis of hundreds and hundreds of time series. For instance, Average Hours Worked in Manufacturing is a good candidate for leading indicator of manu- facturing activity because we know from independent observation that it is easier for a business establishment to take care of a moderate increase in orders by overtime than by hiring new workers and easier to tide over a lull by putting its workers on short time than by laying some off at the risk of losing them. All the sensible leading indicators thrown up in the National Bureau's screening are obvious in a similar way. An oddity like the production of animal tallow, which is said to lead better than many other series, could not have been discovered by a priori reasoning, but neither is it used by any sane forecaster, for a good empirical fit is no substitute for a sound reason. Part of the findings of the National Bureau are, I think, tautological, because the timing indicators lead, lag, and coincide not with §ach other individually, but with the reference cycle, which is an index of 1 See Geoffrey H. Moore, Statistical Indicators of Cyclical Revivals and Hefem'ons (Occasional Paper 31, New York: National Bureau of Economic Research, 1950), particularly chap. 7 and appendix B. 188 TIME SERIES "general business activity." The latter is a vague conglomerate of employment, production, price behavior, and monetary and stock market activity; therefore, it is no wonder at all that some series lead, some coincide with, and others lag behind it. The reference cycle is a useful summary, but we should not be misled into existential fallacies about it. 12.14. The diffusion index A diffusion index is a number stating how many out of a given set of time series are expanding from month to month (or any other interval). Diffusion indexes can be constructed from any set of series whatsoever and according to a variety of formulas, of which I shall discuss just three. There are two reasons why one might want to construct a diffusion index. One is the belief that a business cycle starts in some corner of the economic system and propagates itself on the surrounding territory like a forest flro. This says in effect that the diffusion index is a cheap short-cut ftutorcgrcssive econometric model. The second reason is that the particular formula used to construct the index captures in a handy way the logic of economic behavior. Three different formulas have been suggested for the diffusion index: Formula A Per cent of the series expanding Formula B Per cent of the series reaching turns Formula C Average number of months the series have been expanding Research by exhaustion argues that we ought to try all these formulas on all time series and choose the formula that gives prag- matically the best results. This can be done quite cheaply on the Univac. I think such a procedure will frustrate our search for good indicators, because each formula embodies a different theory of economic behavior, not universally suitable. Formula A is justified by the classical type of business cycle, where income, employment, prices, hours, inventories, production, and so on, and their components move up and down in rough agreement or with characteristic lags. Suppose, however, that the authorities control totals — employment, some price index, credit, or the balance of 12.14. THE DIFFUSION INDEX 189 payments. The result is " rolling readjustment" rather than cycles. Formula A has lost its relevance. In a world of rolling readjustment this formula will show an uneventful record and will not be able to indicate, much less predict, sectional crises hiding under a calm total. Formula B is justified if consumers and business are more sensitive to turns, however mild, than to accelerations, however violent. Invest- ment plans are likely to be of this kind. As long as there is expansion* in demand, any overexpansion will be made good eventually. If there is contraction, however small, the mistake is more obvious, and panic may easily result. On the other hand, there are many areas in both the consumer and the business sectors where small turns are not taken seriously. Formula B, therefore, can be used to best advantage in studying certain investment series (like railways' orders of rolling stock) but is counterprescribed elsewhere. Formula C gives great emphasis to reversals. Take a component that has been slowly expanding for some months, then turns down briefly. Formula C registers (for this component) I? 2, 3, etc, up to a larjjjc positive number, then —1 (for the first month of contraction), The more sustained the expansion, the more violently does the formula register a halt or small reversal. This formula, then^ Is appropriate where habit and momentum play an important part. Where could we possibly want to apply it? Hire-purchase may be related to disposable income Sn some way that agrees with the logic of formula C. Suppose that small increases in income go into down payments and time payments iof morci and more gadgets; if so, a small fall in income would put a complete stop to new hire-purchase, because the family would continue tti oontraottial time payments on the old gadgets and would not be llkc'o to mit into food, clothing, and recreation to buy new gadgets, This Is a theory of consumer behavior, and formula C is a convenient way to express it short of an econometric equation. Exercises 12.E Construct diffusion indexes by each formula from the two time series below: Series 1 100, 96, 90, 96, 97, 95, 97 Series 2 100, 99, 102, 100, 100, 101, 10? 190 TIME SERIES and compare the cyclical behavior of the indexes with that of the sum of the two series. 12.F In Exercise 12.E, series 2, replace the 99 by 101, and con- struct formula A. Must turning points in the sum be preceded by turning points in the index? 12. G Construct an example to show that an index according to formula C can be completely insensitive to the sum of the component series. 12.11 Show by example the converse of Exercise 12.G, namely, that swings in the d illusion index formula C need not herald turns (or any change whatsoever) in the sum of the component series. 12.15. Abuse of long-term series One unfortunate by-product of time series analysis is that it requires longtime series with which to work, and several research organizations have responded enthusiastically to the challenge. For example, I have heard urgings that we construct Canadian historical statistics for the purpose of sorting out timing indicators, on the ground that what took the National Bureau 30 years can now be done in 30 hours electronically. I think this kind of work quite futile, for a few moments' reflection will convince us of its negative results. The Canadian economy, compared with the American, is small and relatively unbalanced; therefore, Canadian historical statistics will have a very large irregular component, which will overwhelm the lino structural relationships we want to uncover. The Canadian economy, being open, responds to impulses from abroad; therefore, even if we had good domestic historical time series, our chances of finding among them good indicators are slim. We also know that the Canadian economy is "administered" (it has more governments per capita than we have and moro industrial concentration); so the developments that are foreshadowed by the indicators are likely to be anticipated by the big policy makers, with the result that predictions go foul. We know that Canada is and will be growing fast and that the past (on which all indicators rely) will not be a dependable guide* My guess is that the earliest useful year for time series on bread baking is somewhere around 1920. For iron ore shipments it is 1947, the year when certain Great Lakes canals were deepened. However, 12.16. ABUSE OF COVERAGE 191 for housing demand as a function of family formation, many decades or even centuries might prove to contain valid information. There are many good reasons why we might want to construct uniformly long historical statistics, but certainly the needs of cycle forecasting is not one of them. 12.16. Abuse of coverage An unfortunate by-product of diffusion index analysis is that it encourages the construction of complete sets of data when incomplete ones would be more satisfactory. This is so because the timing and irregular features of the diffusion index change with the number of scries included in it. Let us suppose that we want to forecast industrial production by means of average weekly hours worked; the series rationalised in Sec. 12. 13 is a possible leading indicator. If hours worked come broken down by industry, we suspect we might do better if we use a diffusion index of the basic series rather than the over-all average. Now our first impulse is to look at the published series for Hours Worked and make sure that they give complete coverage by industry and by locality and that the series have no gaps in time. After all, we want to forecast for all industry and for the entire country. Yet it is unreasonable to desire full coverage. First, some industries employ labor as a fixed, not a variable, input. A generating station, if it is operated at all, is tended by a switchman 24 hours a day, regardless of its output. Labor is uncorrelated with output. Here is a case where coverage does harm to our forecast, because it introduces two uncorrelated variables on each side of the scatter diagram, so to speak. Second, in the service industries, the physical measure of output w labor input, because this is how the compilers of government Statistics measure the production of services. If we insist on coverage of the services, we get trivial correlations, not good forecasts. Third, during retooling it is possible to have long working hours and no industrial production. What should we do? Throw out entirely any industry that has retooling periods? Not at all. It is enough to suppress temporarily from consideration the data for this industry until the experts tell us that retooling and catching up on backlog are 192 TIME SERIES over. A deliberate time gap in the statistics improves them. This method, though it appears to be wasting information, actually uses more, for it includes the fact that there has been a retooling period. The fact that the diffusion index is dehomogcnized is a flaw of a second order of importance. In this way we select statistics of average hours worked to use in forecasting industrial production which are a statistician's nightmare: they have time gaps, they are unrepresentative, and they do not reconcile with national accounts Labor Income when they are multi- plied by an average of wage rates. Similarly, for statistics that are most useful in forecasting, it is not necessary that they be classifiable into grand schemes, such as the National Income, Moneyflows, or Input-Output Tables. The Canadians plan to start compiling data on Lines of Credit agreed upon by chartered banks and their customers but not yet credited to the customer's account. Such a series, I think, will prove a better predictor than the present one, Business Loans. Now, if we had information on Lines of Credit, it would not fit any existing global scheme and would not become any more useful if it did. For forecast- ing purposes, I see no excuse for creating a matrix of Inter-sector Contingent Liabilities or for constructing a Balance of Withdrawable Promises account. 12.17. Disagreements between cross-section and time series estimates It is very puzzling to find that careful studies of the consumption function derived from time series give a significantly larger value for the marginal propensity to consume than equally competent studies of cross-section data. Three kinds of explanations are available: (1) algebraic and (2) statistical properties of the model explain the dis- crepancy; and (3) cross-section data and time series data measure different kinds of behavior. We shall concentrate on explanations 1 and 2 in order to show that algebra and statistics alone account for much of the difference and that to this extent explanations of the third category are redundant. Cross-section data are figures of income, consumption, etc., by individual families in a given fixed time period. Time series are data 12.17. CROSS-SECTION AND TIME SERIES ESTIMATES 193 about a given family's consumption and income through time or about national consumption and income through time. Algebraic differences The shape of the consumption function can breed differences. If th© family consumption function is nonlinear, say, C = a + 0y + yy 2 + U (12-18) then the consumption function connecting average income av y and average consumption av c or total income Y and total consumption C will look different from equation (12-18), even if all families have the same consumption function and if the distribution of income remains constant. To see this, take just two families, ci = a + ft/i + y(yi) 2 + Ui d = a + ft/a + 7(2/2) 2 + u 2 add together and divide by 2 to get avc = a + /?av2/ + 27(av y) 2 — 72/11/2 + avu (12-19) and, in general, with N individuals, avc = a + j3av2/ + Nyfav y) 2 — 7 Y Mi + av « (12-20) One might argue that, when income distribution remains unchanged, the cross term Zyflj remains constant and is absorbed into the estimate of a. But this is false, because the cross term appears in (12-20) multiplied by 7, another unknown parameter, whose estimate is bound up with the estimates of a and p in the least squares (or other) formulas. The discrepancy between (12-20) and (12-19) affects the estimates of all three parameters a, ft and 7 if no allowance is made for the extra terms of the average consumption function. The last two terms of (12-20) are equal to » — i that is to say, the raw moment m[ u of the family incomes. It follows that, for time series and cross-section studies to give agreeing results, 194 TIME SERIES the average (or the total) consumption function must contain a term m/y expressing inequality of income, even if this inequality should remain unchanged from year to year. Moreover, neither the sample variance of av y nor the Pareto index is suitable for the correction in question. If income distribution varies with time, to get the two approaches to agree our correction must be more elaborate, because the factor i ViVi must be calculated anew for each year of the data. If we have no complete census of all families, a sample estimate of 2 ?/,•?/,• will be better than nothing. If a census of families exists but we are in a hurry, again we can approximate Zysjj to any desired degree by taking the families in large income strata. Statistical differences Let us assume that the consumption function of a family is linear and constant over time and that it involves another variable x, reflecting some circumstance of the family, like age. c = a + fry + yx + u (12-21) However, let the characteristic z, as time passes, have a constant distribution among the several families. For example, in a stationary population, the ages of the totality of families remains unchanged, although the age of any given family always increases. If we aggregate (12-21) we get avc = a-f-/3av?/-f7ava;-r-avw (12-22) but x, being a constant, is absorbed into the constant term when we estimate (12-22). Not so if we trace the history of one such family by estimating (12-21) from time series. In practice there is a further complication: the characteristic x is not independent of the family's income; thus, /§ and $ are shaky esti- mates in (12-21) because of multicollinearity. This is an additional reason why time series and cross sections disagree. Thus, we do not need to go so far afield as to postulate several kinds of consumption functions (long-term, short-term) to explain these discrepancies. If, after we have corrected for the algebraic and FURTHER READINGS 195 statistical sources of discrepancy, some further disagreement remains unexplained, that is the time for additional theories. Further readings Kendall, vol. 2, devotes two lucid chapters to the algebra and statistics of univariate time series. The proof that ignoring the serial correlation of the random term in a single equation leaves least squares estimates unbiased and consistent can be found in F. N. David and J. Neyman, "Extension of the Markoff Theorem on Least Squares" (Statistical Research Memoirs, vol. 2, pp. 105-116, December, 1938). How to treat serial correlation is discussed by D. Cochrane and G. H. Orcutt, "Application of Least Square Regression to Relationships Containing Auto-correlated Error Terms" (Journal of the American Statistical Association, vol. 44, no. 245, pp. 32-61, March, 1949). Eugen Slutsky, "The Summation of Random Causes as the Source of Cyclical Processes" (Econometrica, vol. 5, no. 2, pp. 105-146, April, 1957), is rightly famous for its contribution to theory and its interesting experi- mental examples with random series drawn from a Soviet government lottery. Correlogram and periodogram shapes are discussed in Kendall, vol. 2, chap. 30. The brief discussion of autocorrelation, with examples, in Beach, pp. 17&- 180, is simple and useful. The early article by Edwin B. Wilson, "The Periodogram of American Business Activity" (Quarterly Journal of Economics, vol. 48, no. 3, pp. 375-417, May, 1934), is both ambitious and sophisticated. Tjalling C. Koopmans, in his review, entitled "Measurement without Theory, " of Arthur F. Burns and Wesley C. Mitchell's Measuring Business Cycles (Review of Economic Statistics, vol. 29, no. 3, pp. 161-172, August, 1947), delivers a classic and definitive criticism of some investign tors' avoidance of explicit assumptions. All would-be chartists should read it. Koopmans also gives, on p. 163, a summary account of the National Bureau method for isolating cycles. J. Wise, in "Regression Analysis of Relationships between Autocorrelated Time Series" (Journal of the Royal Statistical Society, ser. B, vol. 18, no. 2, pp. 240-256, 1956), shows that, in recursive systems of two or more equations, least squares is biased both when the random terms of the separate equations are interdependent and when the random term of either equation is serially correlated. The reference of Sec. 12.12 is G. H. Orcutt, "A Study of the Autoregressive Nature of the Time Series Used for Tinbergen's Model of the Economic System of the United States 1919-1932," with discussion (Journal of the Royal Statistical Society, ser. B, vol. 10, no. 1, pp. 1-53, 1948). Arthur J. Gartaganis, "Autoregression in the United States Economy, 1870-1929" 196 TIME SERIES (Economctrica, vol. 22, no. 2, pp. 228-243, April, 1954), uses much longer time series and concludes that the over-all autorcgressive structure changed drastically around the year 1913. Gartaganis uses six lags. I have discussed the mathematical properties of the diffusion index in "Must the Diffusion Index Lead?" (American Statistician, vol. 11, no. 4, pp. 12-17, October, 1957). Geoffrey Moore's comments are on pp. 16-17. Trygve Haavclmo, "Family Expenditures and the Marginal Propensity to Consume" (Economctrica, vol. 15, no. 4, pp. 335-341, October, 1947), reprinted as Cowles Commission Paper 26, affords a good exercise in the decoding of compact econometric argument. Haavelmo deals with the discrepancies arising from different ways of measuring the consumption function. APPENDIX A Layout of computations I recommend a standard layout, no matter how large or small the model or what estimating procedure one plans to use (least squares, maximum likelihood, limited information) or what simplifying assump- tions one has made. There are three general rules to follow:] 1. Scale to avoid large rounding errors and to detect other errors more easily. Scaling should be applied in two stages. a. Scale the variables. b. Scale the moments. 2. Use check sums. 3. Compute all the basic moments. This may seem redundant, but is actually very efficient if one wants a. To compute correlations. b. To experiment with alternative models. c. To get least squares first approximations. d. To select the best instrumental variables. 197 198 APPENDIX A The rules in detail Stage 1 Scale the variables. Express all of them in units of measurement (say, cents, tens of dollars, thousands, billions, etc.) that reduce all the variables to comparable magnitudes. Scale the units so as to bring the variables (or most of them) into the range from to 1. For instance: National income x\ = 0.475 trillion dollars Hourly wage rate x 2 = 0.182 tens of dollars Population x 3 = 0.165 billions Price of platinum Xa ■■ 0.945 hundreds of dollars per ounce This, rathei tfian the range 1 to 10 or 10 to 100, is preferred, because we shall include an auxiliary variable identically equal to 1. Then all variables, regular and auxiliary, are of the same order of magnitude. Stage 2 Arrange the raw observations as in Table A.l. Note that the endogenous variables, the y's, are followed by their check sum Y and that, in addition to all the exogenous variables t\ % z 2y . . . , z#_i, we devote a column to the constant number 1, which is defined as the last exogenous variable 2//. These are then followed by the check sum Z of the exogenous variables including zh — 1 and by a grand sum X - Y + Z. Stage 3 The raw moment of variable p on variable q is defined as where the sum is over the sample. A raw moment is not the same thing as the simple moment m pq defined in the Digression of Sec. 1.2. The simple moment m pq is also called the (augmented) moment from the mean of variable p on variable q. Compute the raw moments of all variables on all variables. This gives the symmetrical matrix m' of moments, shown in Table A.2. In Table A.2 the symbol m' is omitted, and only the subscripts appear; for instance, yoyi stands for m' vgUtm LAYOUT OF COMPUTATIONS 199 Table A.l Arrangement of raw observations Endogenous Check SUM Exogenous variables Check sum Grand Time VARIABLES Regular Aux. SUM 1 2 3 8 yid) • Vi(2) ■ yi(3) • Vi(S) • • yo(D • ya(2) ' fr(3) • yo(8) YQ) Y(2) F(3) Y(S) zi(2) • H(S) • 9l<8) ' • *if-i(2) • fjf-ift) 1 1 1 1 WO 2(3) xm xm xm X(S) Table A.2 2/i2/i ' ' • yiya ViY yi«i 2/i«2 • • • yiZH-i y\ • 1 ya mX ■ . . . i . . . . * yayi • • • ycs/o yoY yo2i yo^a • ' • 1/02/f-l ya- 1 yoZ y X YVx • • • Yy YY K*, Yz 2 • • • Yzh-i r-i rz YX Z\V\ - - • Zxya z\y Zl2l 2l«2 ' ' ' Z\ZH-\ fl-1 z x Z SiX ziyi ' ' - z 2 ya z 2 Y Z2Z1 Z&t • • Z&H-X *,.l ZiZ HX • ' • • • • • • • • Uff-iyi* • ' ZH-iyo ZH-l ZH-\Z\ ZH-lZ* ' • ' ZH-lZH-l ZH-1'1 2*-»Z ZH-lX — » 1 • V\ • • •l-2/o i.y I'Zl 1 • Z% • • 1 ' 2ff-l 1-1 \z IX Zyi • • • Zy zr Z«i £« 2 • • Zzir-i z-1 zz ZX Xy, • • • Xy a xr x«, x« 8 • • XZH-I XI XZ XX 200 APPENDIX A Stage 4 Compute the augmented moments from the mean of each variable (except zh — 1) on each variable, e.g., m XlXi « Sm XiXi - m Xx . x m\. x% This is done very easily because m Xi . x and m Xx . x are always on the level indicated by the arrows and in the row and column corresponding to This procedure gives a square symmetric matrix m of moments from the mean. The new matrix contains one row and one column less than the matrix m\ Stage 5 Rule for check sums. In both m and m' any entry containing a capital Y (or Z) is equal to the sum of all entries in its row that contain lower-case y's (or z's). Any entry containing a capital X is equal to the sum of everything that precedes it in the row. All these things are true in the vertical direction, since the matrices m' and m are symmetric. Stage 6 Scale the moments. This step is not always possible. Scan the symmetric matrix in. If it contains any row (hence, column) of entries all (or nearly all) of which are very large or very small relative to the rest of the rows and columns, divide or multiply the entire offending row and column by an appropriate power of 10. The purpose is to make the matrix m contain entries as nearly equal as possible. When moments have comparable magnitudes, matrix operations on them are very accurate, rounding errors are small, and calculating errors can be readily detected. Keep accurate track of the variables that have been scaled up or down in stages 1 and 6 and of how many powers of 10 in each stage and altogether. . Stage 7 Coefficients of correlation. These can be computed very easily from m, but unfortunately the checks do not work in this case. So LAYOUT OF COMPUTATIONS % 201 drop the check sums and consider only part of m. The saiftpl© correlation coefficient between, say, the variables y a and zh is "iviVfirhkn Coefficients of correlation are used informally to screen out the most promising models (see Chap. 10 on bunch maps). Matrix inversion This is a frequent operation in estimating medium and large systems. Details for computing M- 1 and MhW are given in Klein, pp. 15 Iff. There are various clever devices for inverting a matrix and performing the operation M -1 N. Electronic computers have standard programs, and it is well to use them if they are available. If M is small in size and if both M" 1 and M""*N are wanted, do the following: Write side by side the matrix M, the unit matrix of the same size, and then N. [M][I][N] Then perform linear combinations on the rows of the entire new matrix [MIN] in such a way as to reduce M to a unit matrix. When you have finished, you will have obtained [I][M~ l ][M- l N] For example, let n/r T2 41 K re 30 sol We shall trace the evolution of [M][I][N] into [IHM^HM^N], 4 6 (RON). -[J 4 1 1 6 30 50 2 1 3 Divide the first row by 2. (MIN)i-[[ 2 i y 2 6 I 1 3 15 25 2 1 3 202 APPENDIX A In this new matrix, subtract row 1 from row 2 j M (MIN), = [J Divide the new second row by 2. (min),* [J ; 3 15 -1 -14 ! 3 Yi 15 -7 25] -22j 25 -11 Subtract the new second row from the first. (MIN), A 1 ° [0 2 % -X -k y 2 3H 22 36] -7 — 11 J Divide the new second row by 2 (MIN), 1 Lo_j. [i] H -K 22 36 ] / ^-5^J [M- l N] [M- l l One can compute a string of "quotients" M -1 N, M _1 P, M _1 Q, etc., by tacking on N, P, Q and performing the linear manipulations. This technique works in principle for all sizes of matrices with more than three or four rows, but it consumes a lot of paper and time. APPENDIX B Stepwise least squares Estimating the parameters of V = 7o + 71*1 + 72*2 + 73*8 + ' • • + 7//*// + U by desk calculator, according to Cramer's rule, or by matrix inversion is a formidable task when // is greater than 3. The stepwise procedure about to be explained may be slow, but it has three advantages over the other methods: 1. It can be stopped validly at any stage. 2. It possesses excellent control over rounding and computational errors. 3. We do not have to commit ourselves, ahead of time and once and for all, on how many decimal places to carry in the course of computa- tions, but we may rather carry progressively more as the estimates are successively refined. I shall illustrate the method by the simple case w = ax + &y + 7Z + u where (in violation of the usual conventions) w is endogenous, and x, y, z are exogenous. For the sake of illustration let us assume that, in the sample we happen to have drawn, the exogenous variables are 203 204 APPENDIX B "slightly" intercorrelated, so that m xv , m XBf m xu , m va , m„ u , m au are small numbers, although, of course, m xx — m vv = m tt — 1. Step 1 On the basis of a priori information, arrange the exogenous variables from the most significant to the least significant, that is to say, accord- ing to the imagined size of the parameters a, 0, 7, disregarding their signs: M > 101 > M Step 2 To estimate a first approximation to a, compute &\ = 7n wx /m xx as an approximation to the true value. Let &\ *■ a + A%. StepS Form a new variable V «* W — OL\X and estimate a first approximation to P by computing • * m vy m uy Step! Form the new variable s = v - fry and then compute the first approximation to 7: ■ m„ Step 5 Form the new variable Wl"«- 7i2 and compute *"/il — m xx The idea here is to estimate the error A\ of our first approximation STEPWISE LEAST SQUARES 205 &i. Compute at = 5i — Ai as a second approximation to a. Step 6 Now that a better estimate of a is available, there is no point in correcting the first approximations &, 71. We discard them and attempt to get new approximations &, 72, based on the better estimate a 2 . We first use a 2 to define a new variable Vi — w — a 2 x Note that in this step we adjust the original variable w (not w x ). Proceed now as in steps 3 to 5 : P2 == -—* m vv si =* vi - (3 2 y 72 = ^ W2 = Si — 722 S3 = «2 — A2 and so on. The method of stepwise least squares does indeed yield better esti- mates in each round. To see this, consider the steps - = mwx -- a Jb Wtfu+y+y)-* 011 m tx ~ m xx = a + Ax v = w — <iix = ax 4- /% -f- 72 + w — ax — Atf = #2/ + 72 + w - Aix * m vy niiyz+u-Axx)^ " l ^T" = " "+■ ™ Tftyy TTlyy s = v - &*/ = 0*/ 4- 72 4- w - Atf - Py - B x y = 72 4- w — i4iz — Biy m 2 . m„ = 74-C 206 APPENDIX B t0i « u — fi« «■ 72 -f u — Ais — Biy — 7i* — Ciz = u — Aire — #iy — Ciz m zx m xx «i - fii - JTi - a + ki - At + B±igiHzgi±f I *W<U-ll|y-C,I>.* = a -t- m xx = a + A The residual factor A 2 is of smaller order of magnitude than Ai, and so d 2 is better than <5i as an estimate of a. To see this consider m xx ?7lxi m xx Expressing B\ and Ci in terms of Ai, 3 * L m « m « m w m ** ** m « m w ^« J [#&„* ^*y W«< 77lyg 771,2, I m xx m vv ra xx m„ m vl/ J , |7n UI ^_ m vx m U i/ __ m ax m UM . m uv m tx m yt 1 L 7n xx 7n xx rriyy m xx m M w vv m xx m, tu J Each bracketed term is of small order of magnitude. So, unless y is numerically very large (which was guarded against in step 1), it follows that a 2 is an improvement over &i. The same can be shown for J}*, 72, and <5 3 , compared with &, 71, and 5 2 , respectively. The method of stepwise least squares can also be used when x, y f and z are endogenous variables. In this case, although the bracketed terms are not negligible, they keep decreasing in successive rounds of the procedure. Had another variable been treated as the independent one, the stepwise method (like any procedure based on naive least squares) would, in general, have given another set of results. APPENDIX € Subsample variances as estimators Consider the model y = a + yz + u, under all the Simplifying Assumptions. Let be the maximum likelihood (and also the least squares) estimate of 7 based on a sample of size S. Its variance is «i,m = «« - *f)* = «« - yy = «feV |" (mi?i + • • • + ttsza)' "! - [ M H + 4)2 J Holding z\, , . . f zs fixed, this reduces to (*!+••• + 4) 2 *! + • • • + 4 m„ Let us now ask what happens on the average if from the original sample we obtain its S subsamples (each of size S — 1), if we compute 207 208 APPENDIX C the corresponding parameters and if we then compute the sample variance V of these fs. SV m V [f («) - av f (•)]» - V (B.) 2 - § ( V *.)' e(5F) - SuV m g Y(£.) a - | e ( V #.Y For our fixed constellation of values («i, . . . , zs) of the exogenous variable, any terms of the form $(%%&&) (t 3^ j) equal zero. By careful manipulation we obtain sev - (r ttU r y — j—3 ~ ± y — x — 2 a a £ Z/ (m ta - z\)(m„ - z^J »<> So far this is an ezed result. If we make the further assumption that the values zi, . . . ,z 5 of the exogenous variable are spread "typically," the relations . S- 1 m„ — z; = — •$ — m lt m„ - z? - zj = g m„ are approximately true, or at least become so in the course of the summations given in the last brackets. Therefore, It follows that (S — l)V is an unbiased estimator of cov (^,^|S). APPENPI^ D Proof of least squares bias in models of decay Let the variable 5« be equal to 1 if time period t is in the sample, and zero otherwise. The least squares estimate of 7 is *- y hytyt-i j 2 fctf-i The proposition e(^) < 7 will be proved by mathematical induction. It will be shown true for an arbitrary sample of S = 2 points; then its truth for £ + 1 will be shown to follow from its truth for any & Definition. A conjugate set of samples contains all samples having the following properties : 1. The samples include the same time periods and skip the same time periods (if any) ; let h be the first period included and ta the last. 2. If time period t is included in the samples, then all samples of the conjugate set have disturbances u t of the same absolute value. The disturbances do not have to be constant from period to period. 209 210 APPENDIX D 3. When a time period is skipped by the sample, algebraically equal disturbances must have operated on the model during the skipped periods. 4. The samples have come from a universe having, as of t l} the same values for all predetermined variables. Consider an arbitrary sample of two points. Let one point come from period j and the other from period k (k > j) ; the sample can be completely described by the disturbances operating at and between these two time periods; that is, Si = (+ty,ty+i, • • • ,w*_i,+WjO Si has three conjugates: S 2 = (+tt/,% + i, . . . ,w*_i,-w*) Sa = (— Uj,Uj + ii . . . ,u k -i,+u k ) S 4 = (— Uj'M+u . . . ,Wfc_i,-w*) Denote the four corresponding least squares estimates of y by the symbols t(++), 7(+~), 7(-+), and ?(--). By definition, each of the four conjugate samples has inherited from the past the same value ?//_i of lagged consumption. In period j the random disturbance Uj operates positively for samples Si and S 2 and negatively for S3 and S4. Therefore, in the next period Si and S2 inherit one value v = yyj-i + u, for lagged consumption, and samples S3 and S« inherit another value n = 72//-. 1 — Uj. By the definition of conjugates, in periods j + 1, j + 2, . . . , k — 1, equal random disturbances affect all samples Si to S 4 . Moreover, in period /c, samples Si and S2 receive an equal inheritance of lagged consumption from the past. Call it y p . Its exact value can be obtained by applying model (3-2) (see Sec. 3.2) to p successively enough times, but this value is of no interest. Samples S3 and S4 each get the inheritance y n which likewise arises from the application of model (3-2) to n. The two inheritances are different: y p > y nf since p > n. When we come to period &, samples Si and S2 part company, because the first receives a boost +u k) and the second receives the opposite — Uk. For the same reason S3 parts company with S4. Define q = TJ P + u k r = yy p - u k v = yy H -f- u k w = yy n - u h PROOF OF LEAST SQUARES BIAS IN MODELS OF DECAY 211 The four conjugate estimates are •V 1/ " ,i2 , i 7 i2 '\ / " ,.2 i ,,2 i/y-i T* 2/n i/;-l T £'n Symbolize the sum of these four estimates by Y ^ (± ±) or Y f. ceaj Si v«/4. 4.\ - 2py y -i + (g + r)y p 2ny y .i + (g 4- io)y, \ */, 2 -i + 2/p T nJL t +:.yS / \ vU + vl vU + vlJ = Ay -f residual The residual is always negative, because y\ > y\ if y>_i > 0, and 2/? < yl if 2/;-i < 0- Therefore the average ^ estimate from this conjugate set of samples is less than the true 7. Consider an arbitrary sample of size S + 1, which I call sample B(-f). Let it contain observations from time periods ji, j 2 , . . . t js, js+i (which need not be consecutive). B(+) can be completely described by the disturbances that generated it plus the predetermined condition ?/ ; v-i. B(+) - (Vh-il*tMu • • • tUfcWiJ Now consider another sample A which contains one time period (the last one) less than B(-f-) but which is in all other respects the same as B(-h): A - (sfo-ij u ht u Jv . . . ,u Js ) The conjugate set of A can be described briefly by conj A « (yy r i; ±u jlf ±u jv . . .,±u Ja ) The conjugate set of B(+) has twice as many samples as conj A; the elements of conj B(+) can be constructed from elements of conj A by adding another period in which the disturbance wy a+ , shows up once with a plus sign and once with a minus sign. Define B(— ) as the sample consistent with predetermined condition 212 APPENDIX D y ix -i and containing all the disturbances of sample B(-f ) identically, except the last, which takes the opposite sign. Therefore, if B( + ) = (fr,-i;tCft,1ffe • • • ,Wfc,+Wy*J then 'B(-) = (y^-i) u h ,u jv . . . ,u Ja , - u ja+l ) Assume that the estimates ^ derived from conj A average less than the true 7 (0 < 7 < 1). Symbolize this statement as follows: 2t(± ± • ' * ±) <2 S 7 Each ^ in the above sum is a fraction of the form Let ^(A) stand for the estimate derived from sample A. Then ^(A) can be expressed as a quotient N/D of two specific sums N and D, where D is positive. Each sample from the set conj B gives rise to an estimate of 7. The formulas now are fractions like (1), but the sums have one more term in the numerator and one more in the denominator, because one more period is involved. Thus, Writing y f for y Ja) N + y'W + tffrj N + yy f * - y'u ja _ x *[B(+)] Similarly, It follows that D + y"* D + 7/' 2 w><->] - N+ tl7> ?[B(+)] + ?[B(-)] = 2^^ If -y(A) > 7, then the last fraction is less than 7(A); if 7(A) — 7, then the fraction equals 7; if 7(A) < 7, then the fraction is less than 7. Exactly the same is true for all samples in the conjugate set of A. Therefore, 2 > < 2 2 v < 2S + l y conj B conj A vrhich completes the proof. I shall not discuss what happens if the value of 7 does not lie between and 1, because no new principle or new difficulty arises. As an exercise, find the bias of 7 if 7 lies between and — 1. APPENDIX E Completeness and stochastic independence / The proof that s(u g ,u p ) = implies det B 5^ is by contradiction. If det B = 0, then a nontrivial linear combination of B's rows is (the zero vector) : L = Xi?i +•■■••. + \g$q - (1) Hence, Xigiy 2 + • • • + \o$ y 2 = But, by the model By + Tz = u, we also have So XiWi +•••. + \ Q u Q = (Xiyi +.•••+ XoYg)z = Z (2) Since Z is a constant number for any constellation of values of the exogenous variables, we have in equation (2) a nontrivial linear relation among the disturbances U\, . . . , Uq. This contradicts the premise that they are independent. This argument shows that £(u 0) u p ) — if and only if det B^O. 213 APPENDIX F The asterisk notation A single star (*) means presence in a given equation of the variable starred; a double star (**) means absence. Accordingly, in the model Vl + 7ll*l + 712*2 + 713*3 + 714*4 = Ml (1) 0212/1 + 2/2 + 023?/3 + 721*1 + Y22*2 H" 723*3 = ^2 (2) 0312/1 + 2/3 + 731*1 + 732*2 = u z (3) with reference to the third equation, y* means the vector of the endogenous variables present in the third equation, namely, vec (yi,yz) y** = vec (#2) z* = vec (*i,*2) z** = vec (*3,*4) For the first equation, y* ■ vec (?/i) y** = vec (2/2,2/3) Z* = VeC (*i,*2,*3,*4) Stars (single or double) may also be placed on the symbol x, which 214 THE ASTERISK NOTATION 215 stands for all variables, endogenous or exogenous. Thus for the second equation, x* - vec (2/1,2/2,2/3,2:1,32,**) x** = vec (z*) G* is the number of if a present in the ^th equation ; G** is the number absent. H* is the number of z's present; H** is the number absent. Examples: G* m 1 H* x = 4 Gt* = //? - 3 m* m 1 H$* m 2 «*, (#> T? are vectors made up of the nonzero parameters of the ^th equation in their natural order, a here serves as a general symbol, like x, for all parameters, or 7. Examples: a? = vec (1,711,712,713,714) Bf - vec (1) 7? = vec (711,712,713,714) 7? = vec (721,722,723) aj = vec (031,1,731,732) yJ = vec (731,732) In Chap. 8, 1 place stars (or pairs of stars) not on vectors but on the variables themselves to emphasize their presence in (or absence from) an equation. For instance, in discussing the third equation above, we may write y* , t/**, y*, z*, **i z * *> z ** to stress that y h y if f i, z% do appear in the third equation whereas the other variables 2/1, z%, z K do not. Finally, AJ* means the matrix that can be formed from the elements of A by taking only the columns of A that correspond to the variables x** absent from the gth. equation. For example, ** _ A? The columns of A** correspond to x** = vec (2/2,^3). "0 0" 1 023 .0 1 . AJ* corresponding to x** « vec (y^z^zi). 713 714 1 723 Index Aggregation, 5 Allen, R. G. D., xvii, 51 Assumptions, additivity, error term, 5, 18,22 Simplifying, 9-17 statistical, 4 stochastic, 4, 6, 95 (See also Error term) Autocorrelation (see Serial correlation) Autorcgression, 52-53, 61, 168, 195 of the economy, over-all, 185-186, 195-196 Bartlett, M. S., 51 Beach, E. F., xvii, 21, 146, 155 Bennion, E. G., 134 Bias, 35-39 instrumental variables, 114 least squares, 38, 47, 68, 72 in models of dccnv, 52-62, 209-212 secular consumption function, 71 simultaneous interdependence, 65-68 (See also Unbiascdncss) Bogus relations, 89-90 Bronfenbrenner, Jean, 72 Bunch map analysis, 146-l50j 105 Burns, A. F., 195 Business cycle model, 171-1T4, 183-184 Causality, 63-64, 106, 108, 112-113, 119-121 Causation, chain of, 119-121 Central limit theorem, 14 Characteristic vector (eigenvector), 124 Christ, C. F., 21, 134 Cobweb phenomena, 16 Cochrane, D., 195 Completeness, 86, 106, 213 (See also Nonsingularity) Computations, 32, 197-202 Conflicting observations, 109 Confluence, 103-106, 155 linear, 142-144 Consistency, 24, 36, 43-46, 51, 118 in instrumental variables, 114 in least squares, 44, 47-49, 195 in limited information, 118 and maximum likelihood, 44, 47-49 217 218 Consistency, notation for, 44 in reduced form, 100, 103 in Thcil's method, 118, 128 Consumption function, cross section versus time series, 192-196 examples, 2, 63-72, 137-139 secular, bias in, 71 Correcting denominators, 180-181 Correlation, 141-142 partial, 144-145 serial, 168-176, 195 spurious, 150 Correlation coefficient, matrix, 145 partial, 144-145 Bamplc, 141-142, 155 computation of, 200-201 universe, 141-142, 155 Correlograms, 174-175, 195 Cost function, 21-22 Counting rules for identification, 92-96, 102 Courant, R., 84 Co variance, 19, 27, 82-83, 170, 207-208 error term, 30, 49 population and sample, 141 Cramer's rule, 35, 203 Credence, 2^, 4i-4L», 110 Cross-section versus time-series esti- mates, 192-196 (See also Consumption function) Cyclical form, 121 (See also Recursive models) Cyclical indicators, 182 (See also Business cycle model) David, F. N., 195 Dean, J., 22 Decay, models of, initial conditions in, 53, 68-61 least squares bias in, 52-62, 171, 209-212 unbiased estimation in, 60-61 Degrees of freedom, 3 Determinants, 34-35 Cramer's rule, 35 Jacobians, 29, 73-74, 79-82 Diffusion index, 182, 188-190, 196 and statistical coverage, 191-192 INDEX Discontinuity, of hypotheses, 137-138 probability, 9-10 Disturbances (see Error term) Dummy variables, 140, 156-157, 165 Efficiency, 24, 47, 51, 118 in instrumental variables, 118 in least squares, 47-49 and heteroskedasticity, 48 in limited information, 118 and maximum likelihood, 47-49 in Theirs method, 118 Eigenvector, 124 Elasticities, price, and Haavelmo's proposition, 72 Equations, autorcgrcssivc, 52-53 simultaneous, 67-71, 73-84 notation, 74-76 structural, bogus, 89-90 Error term, 4, 8, 9-18 additivity assumption, 5, 18, 22 covariance matrix, 30, 49 Simplifying Assumptions, constancy of variance (no. 3), 13, 17, 77 (See also Heteroskedasticity) normally distributed (no. 4), 14, 17, 77 random real variable (no. 1), 9, 17, 77 serial independence of (no. 5), 16, 18,77 uncorrelated, in multi-equation models (no. 7), 78-79, 87, 213 with predetermined variable (no. 6), 16, 18, 33, 52, 53, 65n. zero expected value of (no. 2), 10, 17, 77 Errors, of econometric relationships, 64-76 of measurement, 6, 48 in variables, 155 Estimate, variance of, 39-43 Estimates in multi-equation models, in- terdependence of, 82 Estimating criteria, 6, 8, 23, 47 (See also Consistency; Maximum likelihood; Unbiascdncss) INDEX Estimation, 1-2, 8, 108-110 simultaneous, 67-68, 126-135 unbiased, in models of decay, 60-61 Estimators, extraneous, 134 subsample variances as, 207-208 Expectation, 18-20 Expected value, 10-11, 19-21 Factor analysis, linear orthogonal, 160- 164 unspecified, 156-165 versus variance analysis, 164-165 Fisher, R. A., 51 Forecasts, criteria for, 132-134 leading indicators as, 186-188 Fox, K. A., 21, 134 Friedman, M., 72, 134 Frisch, R., 155 Gartaganis, A. J., 195 Gini, C., 51 Goldberger, A. S., x, 21 Goulden, C. H., 155 Haavelmo, T., 71, 106, 155, 196 Haavelmo proposition, 64-66, 71-72, 125 Heteroskedasticity, 48-49 efficiency, 48 (See also Error term, Simplifying Assumptions, no. 3) Hogben, L., 12n., 46, 51 Homoskcdastic equation, 50 (Sec also Heteroskedasticity) Hood, W. C, xvii, 21, 85, 125 Hurwicz, L., 22, 53, 62 Hypotheses, choice of, 154-155 discontinuous, K>7-138 maintained, 130-137 null, 138 questioned, 137 testing of, 136-155 Identification, 3-4, 64, 85-106 counting rules, 92-96, 102 219 Identification, exact, 88, 9£Htt, 94, 107 absence of, 91, 128-131 of parameters in underidcfUified equation, 96-97, 128 (See also Overidentification ; Under- identification) Improvement, 44 Incomplete theory, 5 Independence of simultaneous equa- tions, stochastic, 78-79, 8C-87, 213 Initial conditions in models of decay, 53, 58-61 Instrumental variable technique, 107- 117 properties, 114 related, to limited information, 118, 125 to reduced form, 113 Instrumental variables, efficiency in, 118 weighted, 116-117 Interdependence, simultaneous, 63-72 Jacobians, digression on, 79-82 of likelihood function, 29 references, 84 and simultaneous equations, 73-74 Jaffd, W., x Jeffreys, H., 51 Kaplan, W., 84 Kendall, M. G., xvii, 21, 50, 51, 165, 173n., 195 Keynes, J. M., 51 Klein, L. R., xvii, 21, 51, 84, 106, 124n., 125, 134, 155, 201 Koizumi, S., 21 Koopmans, T. C., xvii, 22, 72, 84, 85, 94n., 106, 155, 195 Kuh, E.,134 Lacunes (missing data), 159n. Lagged model, 53-54, 62 (See also Recursive models) Lango, G., 22 220 INDEX Leading indicators, 186-188 Least squares, 23-50 bias, in models of decay, 52-62, 209- 212 consistency, 44, 47-49, 195 diagonal, and simultaneous estima- tion, 07-68 directional, 69, 125 efficiency, 47-49 and hotoroskodastioity, 48 generalized, 33-35 Haavclmo bias, 65-67 justification, 31-32, 134-135 maximum likelihood, 24, 31-34, 47- 49, 57 naive, compared to maximum likeli- hood, 67, 82-83 reduced form, 88, 98, 113, 127, 133 references, 50-51, 134-135 related to instrumental variables, 113-114 relation to limited information, 125 simultaneous estimation, 67-70 stepwise, 203-206 sufficiency, 47-49 unbiasedness, 35-39, 47-49, 60, 65- 67 used in estimating reduced form, 88 Loser, C. E. V., 157 Likelihood, 24-25, 50 Likelihood function, 23, 28-31 and identification, 90-91, 100-102 (See also Maximum likelihood) Limited information, 4, 118-125, 128 consistency in, 118 efficiency in, 118 formulas for, 123 relation of, to indirect least squares, 125 to instrumental variables, 118, 125 Linear confluence, 142-144 Linear models, multi-equation, nota- tion, 74-77 versus ratio models, 152-153 Linearity, testing for, 150-152 Lh, T. C, 130-132 Marginal propensity to consume (see Consumption function) MarkofT theorem, 195 Marschak, J., 21 Matrix, 3, 27 coefficients, 75 inversion, 201-202 moments, 31, 124, 198-201 nonsinguln/ity of, 86-87 orthogonality, 100-164 rank and identification, 93-94 triangular (recursive models), 83 Maximum likelihood, 23, 25, 29-33, 50, 78-84 computation, 83 consistency, 44-48 efficiency, 47-49 full information, 73-84, 118, 129, 133 identification, 100-103 interdependence, 82-83 simultaneous, 63-71 limited information, 118-125 subsample variances, 207 value for forecasting, 132-133 Mean, arithmetic, expected value, 10-11 Measurement, errors of, 6, 48 Meyer, J. It., 134 Miller, II. L., Jr., 134 Mitchell, W. C, 195 Model, 1 business cycle, 171-174, 183-184 lagged, 53-54, 62 latent, 110-112 manifest, 110-112 recursive, 83-84 supply-demand, 89 (See also Linear models) Model specification (see Specification) Moments, 3, 20, 32-35 algebra of, 51 computation of, 32-33, 197-200 determinants of, 34-35 expectation of, 20 matrices of, 34, 124 raw, 198 from samplo means, 18-20 simple (augmented), 20, 198 Moore, Y. II., 134, 187n., 196 Moving average technique, 61, 168 INDEX Moving average technique, Slutsky proposition, 173-174 Multicollinearity, 3, 103-106 etymology, 105-106 (See also Confluence; Undcridentifi- cation) National Bureau of Economic Research, 181-182, 186-188, 195 Nature's Urn, 11-13, 168 subjective belief and, 11-12 Ncyman, J., 195 Nonsingularity, 86-87 (See also Completeness) Normal distribution, multivariate, 26- 27 univariate, 14—15 Observations, 7 conflicting, 109 (See also Sample) Orcutt, G. H., 72, 185, 195 Original form, 87-88 (See also Reduced form) Orthogonality, 160-164 test of, 163-164 and variance analysis, 165 Oscillations, 53, 172-174 Overdetcrminacy, 88-89 (See also Ovcridentification) Ovcridentification, 86, 89, 98-103, 129 contrasted with underidentification, 100-103 Overidentified models, ambiguity in, 98 P lim (probability limit), 44 Parameter estimates, constraints on, 94-95 Parameter space and identification, 100-102 Parameters, 2, 8 a priori constraints on, 91-94 Pareto index, 194 Pearson, K., 51 Periodograms, 175-176, 195 Population, 7, 11-12, 141 221 Prediction, 2 (See also Forecasts; Leading indica- tors) Probability, 3, 9-10, 24 density, 9 inverse, and maximum likelihood, 51 Probability limit (P lim), 44 Production function, 21-22 Cobb-Douglas, 157-159 Quandt, R. E., x Random term (see Error term) Random variable, 9 Rank, 93-94 Ratio models versus linear models, 152- 153 Recursive models, 83-84, 121 Reduced form, 87-88 bias, 100-103 dual, Theil'a method, 126-128 and instrumental variable technique, 113 least-squares forecasts, 133 Residual contrasted with error, 8 Sample, 6-7, 11-12, 141-142 and consistency. 43-46 size, 53 and unbiasedness, 35-43 variance, 39-43 Samples, conjugate, 52, 54-57 Sampling procedure, 6-21 Scaling of variables, 198 Schultz, H., 22, 9Qn. Seasonal variation, 176-178 removal of, 182 Sectors, independence and nonsingular- ity, 87 split, versus sector variable, 140, 153- 154 Serial correlation, 168-171, 175, 195 in Simplifying Assumptions, 16, 18, 77-78 Sets, conjugate, 55-57 of measure zero, 111 222 INDEX Sbiskin, J., 182n. Simon, H. A., 106 Simplifying Assumptions, 9-18 generalization to many-equation models, 77-78 mathematical statement, 17-18 sufficient for least squares, 32n. (Sec also Error term) Simultaneous estimation, 67^68, 126- 135 Slutsky, E., 173-175, 195 Solow, R. M., 165 Specification, 1 and identification, 85 imperfect, 5 stochastic, 6 structural, 6 Statistical assumptions, 4 (See also Stochastic assumptions) Stephan, F. F., x Stochastic assumptions, 4, 6, 95 (See also Error terms, Simplifying Assumptions) Stochastic independence, 78-79, 87, 213 Subjective belief, 50 and Nature's Urn, 11-12 Subsamplc variances as unbiased esti- mators, 207-208 Sufficiency, 47, 51 Suits, D. B., 21 Supply-demand model, 89 Symbols, special, bogus relations, super- script ©, 89 estimates, maximum likelihood, as in a (hat), 8, 67 naive least squares, as in a (bird), 8, 67 other kinds, unspecified, as in 5 (wiggle), 8 matrices, boldface type, 33 variables, absent from given equa- tion, superscript (**), 91 present in given equation, super- script (*), 91 Symmetrical distribution, 61 Testing of hypotheses, 136-155 Theil, H., 116, 126, 134 Theirs method, 126-128 consistency in, 118, 128 efficiency in, 118 stepwise least-squares computation, 203-206 Time series, 166-196 cyclical, 172 generated by structural models, 1 83 184 long-term, abuse of, 190 National Bureau analysis, 181-182, 186-188, 195 random series and Slutsky proposi- tion, 173 Time-series estimates versus cross-sec- tion estimates, 192-195 Timing indicator, 186-188 Tinbergen, J., xvii, 21, 134, 155, 185, 195 Tintner, G., xvii, 21, 51, 62, 103 Trend removal, 178-181 Unbiasedness, 24, 35-37, 45-46, 51, 118 in instrumental variables, 114 in least squares, 35-39, 47-49, 60, 65- 67 in limited information, 118 reasons for, 53 in reduced form, 100, 103 in subsample variance estimators, 207-208 in Theirs method, 128 (See also Bias) Uncertainty Economics, 45 Underdeterminacy, 88-89 Underidentificatbn, 86, 88-91, 96, 100- 106, 128 of all structural relationships, 130 Universe, 11-12, 141-142 Unspecified factors, 156-165 (See also Sectors, split) Urn of Nature (see Nature's Urn) Valavanis, 8., 22, 196 Variables, endogenous, 2, 64 errors in {see Errors in variab'es) exogenous, 2, 64 INDEX 223 Variables, independent (exogenous), 2, Watts, H. W., 165 64 scaling, 198 standardized, 145 Variance, 19, 40-41 covariance, 27-31, 82-83, 140-141, 170 of estimate, 39-43 estimates of, 41 infinite, 14 subsample as estimator, 207-208 Variance analysis and factor analysis, 164-1 C5 Variate difference method, 21, 179-181 Verification, 2, 136-155 Weights, arbitrary, 50 instrumental variables, 109-111, 116- 117 and least squares, 109-110, 124 Wilson, E. B., 195 Wise, J., 195 Wold, H., 135 Working.. E. J., 106 Yule, G. U., 51 Zero restriction, 92 , Due Date Due Returned Due Returned W£W BOOK ^ ~pfU J~ brtfifrt >eryxnea St£&J[£ii Z&a IsQSlt La. JUL* # QJXL AUG l 881 JUL 2 1981 AUG 3 1 mi BHtinxm SCI-HSSL